ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
38 stars 12 forks source link

MutableSeq Object error #46

Closed ChrisFearn97 closed 2 years ago

ChrisFearn97 commented 2 years ago

I have been receiving an error message when trying to run NEAT to generate artificial sequence data with the gen_reads.py script. Any help would be much appreciated, thanks!

To Reproduce

I have only recently tried to use NEAT and so have set up a conda environment containing the dependencies. I will attach a text file with my conda environment details. conda_list.txt

I have then done a git clone of the repository and run the setup.py script.

I have then used the genSeqErrorModel.py script to generate an error model for the types of reads I would like to simulate using: python genSeqErrorModel.py -i 150_reads1.fq -i2 150_reads2.fq -o errormodel_150 This then generates the file errormodel_150.pickle.gz

I have then used bedtools genomecov and then compute_gc.py to generate a gc coverage bias bedtools genomecov -ibam 150_reads.bam -g ref.fa > 150_reads.bed python compute_gc.py -r ref.fa -i 150_reads.bed -o gc_cov This generates the file gc_cov.pickle.gz

From here I have run the following command:

python gen_reads.py -R 150 --vcf -p 1 -e errormodel_150.pickle.gz --gc-model gc_cov.pickle.gz -r ref.fa -o sim_data150

and receive the following output:

found index ref.fa.fai reading NC_012920.1... 0.008 (sec)

sampling reads... [Traceback (most recent call last): File "gen_reads.py", line 892, in main() File "gen_reads.py", line 615, in main all_inserted_variants = sequences.random_mutations() File "/nfs/anaconda3/envs/read_sim/lib/python3.8/site-packages/NEAT-3.0-py3.8.egg/source/SequenceContainer.py", line 600, in random_mutations temp = MutableSeq(self.sequences[i]) File "/nfs/anaconda3/envs/read_sim/lib/python3.8/site-packages/Bio/Seq.py", line 1662, in init raise TypeError( TypeError: The sequence data given to a MutableSeq object should be a string or an array (not a Seq object etc)

Expected behavior I would have expected it to output reads of 150bp in length in fastq format with a VCF file that have a similar gc content and error profile to real sequence data I possess.

Desktop (please complete the following information):

Additional context I am trying to get this working on a remote server that is running on Ubuntu 20.04.3

joshfactorial commented 2 years ago

I will take a look at this. My suspicion is that there is a problem in genSeqErrorModel.

PratyushTandale commented 2 years ago

Hello, I have been facing the same issue of mutableSeq. I have used the following command

/usr/local/biotools/python/3.8.1/bin/python ./NEAT/gen_reads.py -r hs37d5.fa -tr target.bed -R 151 -c 45 --force-coverage -E 0.002 -M 0 -v ins.vcf --pe 255 84 -p 2 --bam --vcf -o fastq_files/neat-25-125-ins-NGS118

This command runs on the old version of the NEAT

joshfactorial commented 2 years ago

This issue seems slightly different, because you aren't using the sequencing error model. I'll let you know if this should be in another ticket so we can track it separately once I have a chance to dive into this error.

joshfactorial commented 2 years ago

So far, the only way I've been able to reproduce this error is using Biopython 1.78. Your environment says you have biopython 1.79, but I would double check that this is the case for the specific environment running the script. Biopyothn made a substantial change to how MutableSeq's are handled from 1.78 to 1.79. Add this line of code to the top of main to check the version:

import Bio
print(Bio.__version__)
PratyushTandale commented 2 years ago

It is using the version 1.76

joshfactorial commented 2 years ago

Try updating to 1.79 and let me know if that doesn't solve the problem.