ibest / ARC

Assembly by Reduced Complexity (ARC)
Apache License 2.0
41 stars 5 forks source link

enhancement request to deal with unusual read names in SRA datasets (causes KeyError termination) #47

Closed jayoung closed 8 years ago

jayoung commented 9 years ago

Hi there,

Thanks for writing and making ARC available - I am beginning to use it and finding it VERY useful. I have an enhancement suggestion I'd love you to consider, if it's easy to implement. If it's not easy, not to worry!

I've been using ARC on some datasets I downloaded from NCBI's SRA using their SRA toolkit. Single-end data was working well in ARC, but I was having some trouble with paired-end data until I realized that read naming in those SRA files is not like standard Illumina format. I cooked up a script that fixes the read names to be suitable for ARC, but it'd be nice to avoid that step if possible.

When I download paired-end reads from SRA, pairs of reads get named like this: SRR505874.1.1 and SRR505874.1.2 (first pair in the set) SRR505874.51.1 and SRR505874.51.2 (51st pair in the set) etc

ARC doesn't seem to like the .1 and .2 extensions (understandable), but if I fix the read names so that both members of the pair have the same names (i.e. SRR505874.1 for the first pair, and SRR505874.51 for the 51st pair) then ARC works fine. It'd be nice to be able to specify some option to ARC that it strip off the .1/.2 extensions from the names itself. I've pasted at the bottom the error given by ARC when I leave on the .1/.2 extensions looks.

If you want some test datasets, please let me know and I should be able to cook something up.

all the best,

Janet Young


Dr. Janet Young

Malik lab http://research.fhcrc.org/malik/en.html

Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512 email: jayoung ...at... fhcrc.org



ARC Version: v1.1.3 2014-09-02 [2015-04-20 18:32:19,484 INFO 20261] Reading config file... [2015-04-20 18:32:19,608 INFO 20261] max_incorporation not specified in ARC_config.txt, defaulting to 10 [2015-04-20 18:32:19,608 INFO 20261] workingdirectory not specified in ARC_config.txt, defaulting to ./ [2015-04-20 18:32:19,608 INFO 20261] fastmap not specified in ARC_config.txt, defaulting to False [2015-04-20 18:32:19,608 INFO 20261] keepassemblies not specified in ARC_config.txt, defaulting to False [2015-04-20 18:32:20,111 INFO 20261] Setting up working directories and building indexes... /home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_1.fastq /home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_2.fastq [2015-04-20 20:21:46,540 INFO 20261] Sample: T_malaccensis, indexed reads in 6566.36648893 seconds. [2015-04-20 20:21:48,152 INFO 20261] allocating a new mmap of length 4096 [2015-04-20 20:21:48,153 INFO 20261] Running ARC. [2015-04-20 20:21:48,153 INFO 20261] Submitting initial mapping runs. [2015-04-20 20:21:48,153 INFO 20261] Starting... [2015-04-20 20:21:48,156 INFO 12215] child process calling self.run() [2015-04-20 20:21:48,156 INFO 12216] child process calling self.run() [2015-04-20 20:21:48,158 INFO 12217] child process calling self.run() [2015-04-20 20:21:48,158 INFO 12215] Sample: T_malaccensis Running bowtie2. [2015-04-20 20:21:48,159 INFO 12218] child process calling self.run() [2015-04-20 20:21:48,201 INFO 12215] Sample: T_malaccensis Calling bowtie2-build. [2015-04-20 20:21:48,201 INFO 12215] bowtie2-build -f /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/I000_contigs.fasta /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/idx/idx [2015-04-20 20:21:53,044 INFO 12215] Sample: T_malaccensis Calling bowtie2 mapper [2015-04-20 20:21:53,044 INFO 12215] bowtie2 -I 0 -X 1500 --very-fast-local --mp 12 --rdg 12,6 --rfg 12,6 -p 4 -x /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/idx/idx -k 3 -1 /home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_1.fastq -2 /home/jayoung/testRNAseqReads/FASTQfiles/tempUnpack/combined_2.fastq -S /home/jayoung/testARCgene1/firstTry/working_T_malaccensis/mapping.sam [2015-04-20 22:38:52,969 INFO 12215] Sample: T_malaccensis, Processed 126931493 lines from SAM in 6139.76893306 seconds. [2015-04-20 22:38:54,263 INFO 12215] Sample: T_malaccensis Running splitreads. [2015-04-20 22:52:42,095 ERROR 12215] Traceback (most recent call last):

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/process_runner.py", line 62, in run self.launch()

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/process_runner.py", line 43, in launch job.runner()

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/base.py", line 58, in runner self.start()

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/mapper.py", line 55, in start self.splitreads()

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/ARC/runners/mapper.py", line 392, in splitreads read2 = idx_PE2[readID]

File "/home/jayoung/malik_lab_shared/lib/python2.7/site-packages/biopython-1.60-py2.7-linux-x86_64.egg/Bio/SeqIO/_index.py", line 423, in getitem if not row: raise KeyError

KeyError

[2015-04-20 22:52:42,095 ERROR 12215] An unhandled exception occurred [2015-04-20 22:52:42,095 ERROR 20261] Terminating processes [2015-04-20 22:52:42,166 ERROR 20261] ARC.app unexpectedly terminated [2015-04-20 22:52:42,167 INFO 20261] process shutting down

samhunter commented 9 years ago

Hi Janet,

Thanks for giving ARC a try and for your very detailed and informative description of the problem. I am really happy to hear that the software is working well for you.

The problem you are describing is due to the fact that I expect reads to either have "/0" or "/1" at the end, or to have the same name for each member of the pair (to support reads from more recent versions of CASAVA). As you noted, the ".1" and ".2" extensions to the names are not properly handled, so when ARC finds a read in the PE1 index, it tries to look up the same read in the PE2 index and that causes problems. I think I could make a small modification that would take care of this problem. I will try to look into this within the next couple of days and update this issue when I do.

In the long term, I plan to re-write the indexing strategy completely to improve speed and robustness, and will keep this use-case in mind when I do.

thanks again,

Sam

jayoung commented 9 years ago

Thanks for looking into this, Sam - I'll keep an eye out for this in future versions

Janet

samhunter commented 9 years ago

Hi Janet,

This is mostly untested but I have added a new parameter and made some small code changes in order to support SRA formatted reads.

To test this code: git clone https://github.com/ibest/ARC.git cd ARC git checkout indexing

Then in the ARC_config.txt set:

sra=True

This will cause ARC to trim off the last character in the ReadID instead of the default behavior, and should cause it to work with your reads.

I don't have any reads from the SRA sitting around so although I tested this new version with my normally formatted reads, I haven't tested at all with SRA reads. If you don't mind testing it and letting me know whether it works or not that would be awesome! If not, I won't be disappointed, and will try to test it myself as soon as I can.

thanks,

Sam

jayoung commented 9 years ago

Thanks so much, Sam. I will test this in the next day or two: I have a few longer ARC jobs still running so don't want to reinstall the new version until they're done. I'll let you know soon...

I've been digging a bit more into the tools I used to get reads from SRA (NCBI's prefetch and fastq-dump from the sra-toolkit). It turns out there are options that let the user control read naming better than I had understood - I can in fact ask for the original Illumina read names. So if I'd got that right in the first place, I would have been able to get read names suitable for ARC without modification. Sorry to have bothered you when I could have made an upstream fix.

Still, from my point of view it's actually very nice to have this new functionality in ARC to be able to strip off the extensions. For other applications the .1 and .2 extensions seem to be important (tblastn on the reads seems to get screwy if the matepairs have duplicate names), so that ARC option will let me reduce the number of file versions I'm keeping around.

Given that people may have various formats of SRA files around (with and without the extensions), so may or may not want to trim read names, I wonder if it would be clearer for naive users to name that option differently in the config file, to make it more obvious what it's doing: something like trimReadNames?

thanks again,

Janet

jayoung commented 9 years ago

Hi Sam,

Sorry it took me a long time to get back to this. I finally had a chance to test that sra=True option you implemented, and it seems to work nicely - thanks very much! (the version I tested was ARC Version: 1.1.2-Develop 2015-01-15 - took me a while between installing it and testing it).

The one slight word of caution is that with the dataset I tested, I should say that didn't get as nice an assembly as I had before with an older version of ARC (ARC Version: v1.1.3 2014-09-02), using the same ARC.config file as I'd used before (except the sra=TRUE line and the lines that tell ARC where to find the reads). The new version stopped after 2 iterations, with >7 million reads (and ~4 million contigs). The older version ran for 9 iterations, recruited ~5000 reads and made 8 contigs (three of which covered bits of my target).

I'm not too worried about this, but thought I'd mention it in the spirit of full disclosure. I'm not worried because this particular target/dataset combination didn't do all that well to start with (low coverage and divergent target). I'm guessing that certain other parameters/defaults might have changed between the two versions I'm testing that affect repeat handling, and that the altered results are not related to this sra=TRUE modification. In any case, I had to mess around with the parameters quite a bit to get the initial result, so maybe with the updated ARC I just would need to re-tune the parameters.

thanks again,

Janet

samhunter commented 9 years ago

Hi Janet,

No worries about taking a long time to respond, I have been pretty distracted from working on ARC for the last few months myself.

The results you are seeing are quite strange. Just to double check, is there any possibility of a typo in your numbers (7 million reads recruited and 4 million contigs in 2 iterations vs 8 contigs and 5000 reads in 9 iterations from the same data with the same config)? This seems like a very large and worrying change in the output to me, but I can't think of any substantial changes to the read recruitment/assembly strategy that I have made recently which could could explain these results.

I guess I better run some more tests before I merge the latest changes. Thanks for trying ARC!

Sam

jayoung commented 9 years ago

Hi Sam,

No typo - sorry! (I looked again). I am using the sloppymapping=True option, and even on the first iteration the dev version recruited quite a lot more reads than I got before (99 reads with v1.1.3 2014-09-02 and 597 reads with v1.1.2-Develop 2015-01-15). I'm using bowtie2 and have also updated that recently, so it's possible the change is coming from bowtie2 not from ARC. Perhaps I should reinstall the non-dev version of ARC and see how that behaves with the newer bowtie2: that would be a better test of what ARC is doing.

I've also started a run using the dev version on some other assemblies that worked better for me in the first place - I will let you know how I get on. I didn't choose a very good test case the first time around, I'm afraid (but it was the one for which I had reads from SRA). As I mentioned, that dataset didn't yield very good results to begin with: it's an RNA-seq dataset, with a target from a somewhat diverged organism. I think the gene is not expressed at high enough levels for there to be decent coverage of the gene in any case.

thanks again,

Janet

jayoung commented 9 years ago

Hi again,

Here's a second test comparing v1.1.3 2014-09-02 with v1.1.2-Develop 2015-01-15). This test does not use SRA-style read names, so I'm merely testing performance of the two ARC versions rather than that new setting.

The task is that I'm using some whole genomic data we generated from a previously unsequenced species of Caenorhabditis, and I want to assemble the mitochondrial genome (we don't get good assemblies of that doing whole genome assemblies - likely because coverage is way higher than for the nuclear genome). I'm using the C. elegans mitochondrial sequence as a target (13.8kb, ~87% identical to the new species).

With ARC v1.1.3 2014-09-02 I had spent some time messing around with ARC parameters and was very pleased that I got a complete assembly of the mitochondrial genome after 20 iterations (as well as a whole bunch of other contigs that didn't match the target very well - that's fine).

Using the same parameters with v1.1.2-Develop 2015-01-15, my assembly is OK but not as good: (a) it is more fragmented than before - the output contigs seem to separately cover roughly three chunks of the target seq, and (b) for each chunk of the target I get quite a lot of matching contigs that are almost identical to each other - e.g. for the 3' half of my target I get ~15 matching contigs. For example, the "best" two contigs (meaning the best blast hits using the target to query the contigs.fasta file that came out of ARC) are both 7270bp long, with only 1 nucleotide difference between them - it's a little strange the assembler didn't merge them.

I might be able to fix some of that by messing with the parameters again to get them right for the new version, so please don't consider this a complaint or bug report. However, I thought I'd pass that along in case it gives you insight into differences between the dev branch and the release. Is this making any sense to you? If it's useful, I can give you more information on my run.

thanks,

Janet

jayoung commented 9 years ago

Hi again, Sam,

This is a little embarrassing - although there wasn't a typo in my summary of the output of the ARC run where I tested SRA style seq names (the 7 million versus 5000 reads), it turns out I had totally screwed up the input fastq files. I'd previously deleted those big files from our server, and on downloading them again so I could test ARC I used the wrong set of accession numbers (but I foolishly generated fastq files with the same names). I've now used the correct input data, and that run using v1.1.2-Develop 2015-01-15 very nicely recapitulates the results I got with v1.1.3 2014-09-02.

I think the slightly different results on my mitochondrial assemblies still stand, but I'll double-check that too.

I'm sorry for the confusion: I think the multitasking on too many projects is messing with my mind...

thanks,

Janet

samhunter commented 9 years ago

Hi Janet,

Thanks so much for following up on this. I have certainly made similar mistakes with input files etc (in fact one just on Thursday trying to match up sample names in two different platforms!), please don't worry about it.

I have been mulling over any changes that might explain difference in your results. As far as I can tell from memory and my check-ins, The only things I've changed which could impact ARCs read recruitment are 1) https://github.com/ibest/ARC/issues/42 -- which shuts off read correction in SPAdes with "only-assembler=True" set. I found that this did make a very small differences in the final number of reads recruited, but that the assemblies were effectively the same.

and

2) https://github.com/ibest/ARC/issues/44 -- which causes bowtie2 to output only the aligned reads using the --no-unal switch and saves on disk space and I/O with large numbers of reads. I didn't find any difference in the number of reads recruited in my test dataset for this change.

It is certainly possible that I've unintentionally done something else which altered read recruitment, so if you do end up finding that with your mito assemblies please do let me know.

thanks again,

Sam

samhunter commented 8 years ago

Included in v1.1.4-beta release.