Closed jlmarlo closed 3 months ago
Which version of the code are you using? The name thing is a consequence of pathlib, which treats periods delimiters between names and suffixes. I'll see if I can work around that.
I did intentionally work on simplifying the quality score model a bit, but I may have overshot my target.
I believe I'm using version 3.3. The documentation says it's version 3.2, but I cloned everything after the 3.3 update was published
oh okay. Then yeah. let me try and confirm that I can reproduce the behavior you're seeing.
Hello. I just wanted to check up on this issue and see if you were able to replicate the problem or if it is stemming from an error I am making myself.
It most likely is the code, but we've been pushing hard on getting version 4.0 ready. I plan to address this in that release, which I'm hoping will be ready by the end of the year. when you run FastQC on your initial input fastqs, do they look more like you'd expect, with lower quality reads toward the ends?
Yes they look like typical FastQC reports
Okay, thank you. An earlier release may have better fastqs for the moment. I can make ticket to fix this for version 3.
We investigated this and found that it was an introduced error on our part. For now, you can use version 2.0, which runs in Python 2. You should be able to install Python 2 using Conda still. I believe the only package required for normal use is numpy. We will be introducing a fix for this as soon as we can.
Thank you for looking into this and giving an alternative for now. I look forward to the newest version of the tool and the new features you're working on. It's one of the most inclusive and easiest genome simulating tools I've worked with and I can't wait for the future of it!
Apologies for the delays. I'm finally getting to this. Could you send me the quality score model that the code generated? Thanks!
Here is the Error model which from my understanding also has the quality score model. A4416ErrorModel.pickle.gz
I found a couple issues in our implementation. One of which is that it was averaging read1 and read2, which was no doubt flattening the profile a bit. I also needed to preserve some other data. I have a fix in mind that I'm currently implementing. I'm going to fix this in version 4.0 for sure, then retrofit that to version 3, if I have time.
Hello! I've been using this program to simulate some whole genome sequencing data for horses to evaluate different variant calling and variant filtration methods. So far it's been really easy to use and understand. however I've been having issues with the Sequencing Error Model. Regardless of what options I use, what sample fastqs I provide or even if I use the default error models that come in the program files, I can't get any variation in the quality scores. The code that I'm using to generate the error models looks like this:
Then I use the generated model with the following command:
Another thing that may be worth noting is that the name of the sequencing error model appears different in the two commands because the output name for the error model, "M11445.SeqErrorModel," get's truncated to just "M11445" when genSeqErrorModel.py is run, so the corresponding output file is "M11445.pickle.gz". I've seen this happen anytime there's a period in the output name.
When running fastqc on the resulting fastqs I get this distribution which centers around the correct average from the example fastqc report, but does not follow any sort of pattern throughout read length.![image](https://user-images.githubusercontent.com/103147100/197578902-ea640908-4b06-4283-a711-1d92f52a6579.png)
Thank you for all of the great work you've done with this tool and I look forward to using it more in the future.