ksahlin / BESST_RNA

Scaffolding of genomic assemblies with RNA seq data
15 stars 1 forks source link

RuntimeError: dictionary changed size during iteration #2

Closed raimiredwan closed 9 years ago

raimiredwan commented 9 years ago

Hi,

I have a very limited scripting knowledge, while I was running the BESST_RNA tool, it throws me the error below:

Traceback (most recent call last): File "/export/home/nenas/Desktop/program/BESST_RNA/src/Main.py", line 247, in options.mapquality) File "/export/home/nenas/Desktop/program/BESST_RNA/src/Main.py", line 87, in Main (Contigs, Scaffolds, F, param) = MS.Algorithm(G, Contigs, Scaffolds, F, Information, C_dict, param) # Make scaffolds, store the complex areas (consisting of contig/scaffold) in F, store the created scaffolds in Scaffolds, update Contigs File "/export/home/nenas/Desktop/program/BESST_RNA/src/MakeScaffolds.py", line 44, in Algorithm G, Contigs, Scaffolds = RemoveLoops(G, Scaffolds, Contigs, Information, F) #step4 File "/export/home/nenas/Desktop/program/BESST_RNA/src/MakeScaffolds.py", line 161, in RemoveLoops for graph in graphs: File "/usr/local/lib/python2.7/dist-packages/networkx/algorithms/components/connected.py", line 92, in connected_component_subgraphs for c in connected_components(G): File "/usr/local/lib/python2.7/dist-packages/networkx/algorithms/components/connected.py", line 54, in connected_components for v in G: RuntimeError: dictionary changed size during iteration

I run it using this command line: python /export/home/nenas/Desktop/program/BESST_RNA/src/Main.py 2 -c Scaffolds_pass3.fa -f accepted_hits_1_sort.bam accepted_hits_2_sort.bam -o Besst_RNA -e 3 3 -T 50000 50000 -k 1000 1000 -z 1000 1000 >Besst_RNA.out 2>Besst_RNA.err

I actually map my RNA filtered reads using tophat and bowtie2, which I bet will have lots of multimap, due to alternative junctions, which is been taken into account in tophat. Is that the reason for the error.

What is the RNASeq mapper would you suggest?

Any suggestion how to go about this?

Thank you

ksahlin commented 9 years ago

Hi,

Thanks for reporting, and sorry for my late reply. I just pushed an attempt to fix this bug. Please let me know if it solves your problem, otherwise I'll have a look at it again.

osvaldoreisss commented 9 years ago

Hi, I have two issues:

First:

Traceback (most recent call last): File "../../softwares/BESST_RNA/src/Main.py", line 247, in options.mapquality) File "../../softwares/BESST_RNA/src/Main.py", line 87, in Main (Contigs, Scaffolds, F, param) = MS.Algorithm(G, Contigs, Scaffolds, F, Information, C_dict, param) # Make scaffolds, store the complex areas (consisting of contig/scaffold) in F, store the created scaffolds in Scaffolds, update Contigs File "/data/osvaldo/projeto_marisa/softwares/BESST_RNA/src/MakeScaffolds.py", line 47, in Algorithm (Contigs, Scaffolds, F, param) = NewContigsScaffolds(G, Contigs, Scaffolds, F, Information, C_dict, dValuesTable, param) #step5 File "/data/osvaldo/projeto_marisa/softwares/BESST_RNA/src/MakeScaffolds.py", line 209, in NewContigsScaffolds print 'Nr of new scaffolds created: ' + str(len(newscaffolds)) TypeError: object of type 'generator' has no len()

I just coment the the print lines, but after I had this error:

Traceback (most recent call last): File "../../softwares/BESST_RNA/src/Main.py", line 247, in options.mapquality) File "../../softwares/BESST_RNA/src/Main.py", line 87, in Main (Contigs, Scaffolds, F, param) = MS.Algorithm(G, Contigs, Scaffolds, F, Information, C_dict, param) # Make scaffolds, store the complex areas (consisting of contig/scaffold) in F, store the created scaffolds in Scaffolds, update Contigs File "/data/osvaldo/projeto_marisa/softwares/BESST_RNA/src/MakeScaffolds.py", line 47, in Algorithm (Contigs, Scaffolds, F, param) = NewContigsScaffolds(G, Contigs, Scaffolds, F, Information, C_dict, dValuesTable, param) #step5 File "/data/osvaldo/projeto_marisa/softwares/BESST_RNA/src/MakeScaffolds.py", line 211, in NewContigsScaffolds for newscaffold in newscaffolds: File "/usr/lib/python2.7/site-packages/networkx-2.0.dev_20150608164308-py2.7.egg/networkx/algorithms/components/connected.py", line 109, in connected_component_subgraphs for c in connected_components(G): File "/usr/lib/python2.7/site-packages/networkx-2.0.dev_20150608164308-py2.7.egg/networkx/algorithms/components/connected.py", line 64, in connected_components for v in G: RuntimeError: dictionary changed size during iteration

The command line is: python ../../softwares/BESST_RNA/src/Main.py 1 -c ../montagem/scaffolds.fasta -f ../alinhamentos/alinhamentos.sorted.bam -o scaffold -e 3 -T 20000 -k 500 -d 1 -z 1000

I ran the alignments with bwa:

bwa mem -t 20 ../index/scaffolds.fasta reads1.fastq reads2.fastq > alinhamentos.bam

I saw that the above guy had the same issue. Do you know what could be happening?

ksahlin commented 9 years ago

Hi,

Thanks for your report! I just pushed code that should fix the first bug (and eventually the second one as well). Please download the latest version here on git. Let me know if it solves your problem, otherwise I'll have a look at it again.

Best, Kristoffer

ksahlin commented 9 years ago

By the way, as you now have changed BESST_RNA locally (the commented lines), you might want to overwrite these changes when you pull the new version, see http://stackoverflow.com/a/8888015.

Or redo all changes to exactly the original state (including whitspace) before pulling.

osvaldoreisss commented 9 years ago

Hi,

Thank you for your quick reply. I ran with the latest version now and a get this error now:

python ../../softwares/BESST_RNA/src/Main.py 1 -c ../montagem/scaffolds.fasta -f ../alinhamentos/alinhamentos.sorted.bam -o scaffold -e 3 -T 20000 -k 500 -d 1 -z 1000 ../../softwares/BESST_RNA/src/Main.py:224: UserWarning: parameter -g (treating haplotypic regions) inactivated, parameters -a and -b will not have any effect if specified. warnings.warn('parameter -g (treating haplotypic regions) inactivated, parameters -a and -b will not have any effect if specified. ') Starting scaffolding with library: ../alinhamentos/alinhamentos.sorted.bam Parsing BAM file... Computing parameters not set by user...

Mean of library set to: No mean calc since RNA reads Standard deviation of library set to: No std calc since RNA reads -T (library insert size threshold) set to: 20000 -k set to (Scaffolding with contigs larger than): 500 Number of links required to create an edge: 3 Read length set to: 99.5488375037 Relative weight of dominating link set to (default=3): 3

LG50: 9147 NG50: 1825 Initial contig assembly length: 104738606 Nr of contigs/scaffolds included in scaffolding: 25401 Total time elapsed: 0.384147882462 USEFUL READS (reads mapping to different contigs): 380691 Reads with too large insert size from "USEFUL READS" (filtered out): 91757 Number of duplicated reads indicated and removed: 30835 Mean coverage before filtering out extreme observations = 195.757730545 Std dev of coverage before filtering out extreme observations= 477.337760937 Quantile for repeat detector chosen to: 3.88445706788 Quantile for repeat detector chosen to: 3.88250556384 Quantile for repeat detector chosen to: 3.8783028409 Quantile for repeat detector chosen to: 3.87554197881 Quantile for repeat detector chosen to: 3.87274763561 Quantile for repeat detector chosen to: 3.87146619564 Quantile for repeat detector chosen to: 3.8706938998 Mean coverage after filtering = 123.658055592 Std coverage after filtering = 118.930367846 Length of longest contig in calc of coverage: 215056 Length of shortest contig in calc of coverage: 15693 Perform inference on scaffold graph... Remove isolated nodes. Remove edges from node if more than two edges Remove isolated nodes. Nr of new scaffolds created: 188 Writing out scaffolding results for step 1 ... Traceback (most recent call last): File "../../softwares/BESST_RNA/src/Main.py", line 247, in options.mapquality) File "../../softwares/BESST_RNA/src/Main.py", line 100, in Main Contigs_copy, F_copy = GO.WriteToF(F_copy, Contigs_copy, list_of_contigs) File "/data/osvaldo/projeto_marisa/softwares/BESST_RNA/src/GenerateOutput.py", line 34, in WriteToF del Contigs[cont_obj.name]
KeyError: 'NODE_10231_length_1695_cov_59.1884_ID_20461'

Do you have any ideia? Could it be a problem with the alignments made with bwa?

Best, Osvaldo

ksahlin commented 9 years ago

Have you made sure that the contig names in the fasta file matches the contig names in the bam file? It is the most common reason for this happening. More specifically, the error is often showing when there is one or more contigs present in the bam file references, that is not seen in the contig/scaffold fasta file. In the error log you sent me you have the name of the particular scaffold. So I would start by seraching for the scaffold "NODE_10231_length_1695_cov_59.1884_ID_20461" and make sure it is present in the fasta input file.

Let me know if gets sorted out. Best, Kristoffer

osvaldoreisss commented 9 years ago

Hi,

This contig is present in my input fasta file:

NODE_10231_length_1695_cov_59.1884_ID_20461 ATAATATTCCCACTATAAGTAGTTAAAGAAAGATACTTACTACTAACTTAGCTAATCTAT ATATCTATAGTCTAAGGTGAGAAACTTCTATTCTAACCTTCAGTCTCGCTAGTTCTCTAT TTCTCTATATTAAAGTCTCTCTAATAGTAAGAAGACTACTAAAACTATAATTAGTAGTTA...

Could be some problem in the id format of the contig?

ksahlin commented 9 years ago

ok. I have committed a new version! Please have a go again :)

osvaldoreisss commented 9 years ago

Hi,

Thanks, now it worked.

But I didn't see a significant improvement in the assembly. This is my initial assembly with Spades:

FILE scaffolds.fasta (contigs >= 0 bp)

Clusters: 72271

Total length: 104738606 Total length w/o "N"s: 104733780 Mean cluster size: 1449.24805246918 N50: 9147 (1825 contigs)

and this is the final assembly after running BESST:

FILE Scaffolds-pass1.fa (contigs >= 0 bp)

Clusters: 72047

Total length: 104761006 Total length w/o "N"s: 104756180 Mean cluster size: 1454.06479103918 N50: 9361 (1715 contigs)

The initial DNA dataset is Illumina single-end 100bp and the RNA-Seq are Illumina paired-end 100pb. As a mentioned before I use bwa-mem with default parameters to align the RNA-Seq reads in the assembly. Have you seen better results with other aligner tools?

ksahlin commented 9 years ago

Hi, great that it worked!

My guess is that Scaffolding with RNA-seq does not improve (genome) assembly stats a lot - simply because it only scaffolds the gene space which is usually not a big fraction of the genome. The genes covered in single scaffolds might however be improved, do you have any method for evaluating this, like mapping core genes (CEGMA)?

To significantly improve the genome wide contiguity, I would say you need a (genome) mate pair library.

Best, Kristoffer

osvaldoreisss commented 9 years ago

Hi,

It makes sense. I'll take a look at CEGMA. Thank you for your help.

Best, Osvaldo