ncbi / sra-tools

SRA Tools
Other
1.13k stars 247 forks source link

sam-dump produces invalid SAM #23

Closed DaGaMs closed 8 years ago

DaGaMs commented 8 years ago

I'm not able to ascertain whether this is a problem of the raw data or of sam-dump, but when I do e.g. this:

sam-dump -r SRR2141560 | samtools view -b - > SRR2141560.bam

I get errors of the type:

[E::sam_parse1] SEQ and QUAL are of different length
[W::sam_read1] parse error at line 187
[main_samview] truncated file.

The offending line looks like this:

35      163     1       17649   0       41S110M =       17898   400     GAGCACAAGCACTACTTACTGGCCTAGGTTGTGAGAGAAGTTGATGCTGCTGGGAAGACCCCCAAGTCCCTCTTCTGCATCGTCCTCGGGCTCCGGCTTGGTGCTCACGCACACAGGAAAGTCCTTCAGCTTCTCCTGAGAGGGCCAGGAT   ::;=@@B@CBCACBACABBDC@CAB7DAAADBD@DCDCACCBDCDDCD        RG:Z:4638-MDA-03_CGGCTATG_H0BLEALXX_L004        NH:i:2  NM:i:1

I've tried with sra tools versions 2.3.5 and 2.5.7, with the same result. Any idea what's going on there?

osris commented 8 years ago

its an issue with the raw data. an inconsistency in the bam flags where 1 read of a pair has no primary alignment – either because it is in the bam but never marked as primary or because it is not found in the bam. It produces an internal inconsistency in the SRA archive that is found when you try to read the data. Its subtle enough that we have not been catching it when we load the files. it is not even clear what we should do when we catch it. I’ve looked at the offending sam records from the source files and many are questionable, bordering on nonsense. We are working on a fix and will either reprocess the source files or provide updated sam-dump utility.

DaGaMs commented 8 years ago

Interesting - I wonder what happened there. I'm now trying to dump to fastq and re-map. But it would be good to make this part of the QC in the future...

osris commented 8 years ago

i conflated 2 issues. i think this one is caused by an aligner that hard clips reads, but not qualities in some sam rows. or maybe its qualities but not reads. either way, the data is corrupt as the bases cannot be matched to corresponding qualities. when SRA processes bam files, we have been using the sam flags to determine the primary read. multiple placements of a read with conflicting flag values can lead to processing errors. regardless, we need to add further checks to our loading procedures. sam/bam is the best hope for a defined file format for NGS data, but the spec is too loose and samtools doesn't help the situation as it doesn't do any validation beyond simple formatting checks. any application that has implemented the sam spec (SRA, Picard and GATK) would complain about these bam files. those of following the spec are forced to play catch-up.

DaGaMs commented 8 years ago

So, the data is inherently corrupt and fastq-dump will produce invalid fastq files, or missing reads?

kwrodarmer commented 8 years ago

fastq-dump will produce what was loaded, with reads and qualities. The lengths of each will be what was submitted, missing those that were filtered out during load.

DaGaMs commented 8 years ago

further to this, I've now tried to dump these files into fastq, but fastq-dump is sucking up huge amounts of memory. For example,

fastq-dump --gzip --split-3 -G -T SRR2141573

fails even with 70GB of memory allocated to it. I'm now re-running it with 120TB of memory but I'm not too hopeful even that will be enough. The same happens for all runs in that project. Any idea what the problem is there?

kwrodarmer commented 8 years ago

What is the failure?

DaGaMs commented 8 years ago

Either our cluster scheduler kills it because it uses > 70GB memory, or it dies by itself with

2016-01-14T08:41:17 fastq-dump.2.5.7 err: memory exhausted while resizing buffer within runtime module - failed /users/bschuster/ncbi/public/sra/SRR2141561.sra

=============================================================
An error occurred during processing.
A report was generated into the file '/users/bschuster/ncbi_error_report.xml'.
If the problem persists, you may consider sending the file
to 'sra@ncbi.nlm.nih.gov' for assistance.
=============================================================

You can see an example error report here: https://gist.github.com/DaGaMs/74338cc5e2af7e3c52c3

kwrodarmer commented 8 years ago

Okay, this is good. Yes, your cluster imposed a resource limit that causes us to receive an almost unheard-of out-of-memory error during processing. I would drop the gzipping of output and the splitting and try it without.

The run you mention has 206,754,609,912 bases, which is almost twice the size of SRR1264615 that you mention in another issue. The output of a simple, single-file fastq-dump of this run will produce about a half terabyte file.

DaGaMs commented 8 years ago

I'll try without the splitting/zipping, but I'll need to split by read group and mate pair eventually to map the data correctly.

For clarification: SRR1264615 belongs to project SRP041470. All runs in SRP041470 extract ok and have no SAM issues. On the other hand, all runs in SRP061939 have SAM issues as described above, and all of them lead to humongous memory consumption upon fastq-dump.

kwrodarmer commented 8 years ago

Another thing to point out is that this run, in addition to being quite large, (52G as SRA, but ~500G as fastq which doesn't even account for its alignments), contains 83 references. In the release you have, our code is keeping all 83 of those references in memory. Release 2.6.0 will reduce this down a lot. We'll test to see if this could account for some of the memory consumption.

Regarding the original topic of this issue, we have discovered a number of cases where the submitter has been hard-clipping secondary alignments. Hard-clipping is invalid for submission to archives, for the obvious reason that it discards sequence. The specific case is that secondary alignments that occur before related primaries are being seen as clipped, and their sequence is extracted during load. When the primary is later seen, the full sequence is given but by then is too late. We are addressing this in our ETL.

DaGaMs commented 8 years ago

It seems unlikely to me that the reference issue is responsible for the memory inflation. First of all, I think runs from SRP041470 and SRP061939 where aligned to the same reference, but only the latter causes memory issues. Secondly, the combined uncompressed size of all the reference sequences is 1.4G. Even if everything was loaded into memory, at a 2x overhead per base, it would still only use up < 3G of memory.

kwrodarmer commented 8 years ago

Yes, and this is why we have traditionally cached them.

DaGaMs commented 8 years ago

So, the output of fastq-dump $HOME/ncbi/public/sra/SRR2141580 puts mate pairs concatenated on the same line? So, I should be able to split them by length, hopefully?

However, read-group annotation is completely lost. I won't be able to re-capture the sequencing run for each read, right? That's a bit sub-optimal...

kwrodarmer commented 8 years ago

Well, it's a different use case. For example, 454 mate pairs will contain linkers, and for anyone who wants to perform their own read splitting, that's the way to go. Illumina read-pairs are fixed size, so they can be split externally as well.

vdb-dump will show the splitting information that we use within fastq-dump to create split files. I think we may even have a perl script that does this for you... I'll investigate and post later.

So if you want them to be split within fastq-dump, go ahead - but you may do better by not compressing the output. You can continue to experiment or wait for us to examine this ourselves, the latter of which will take some time.

kwrodarmer commented 8 years ago

You'll notice that the spot length for SRR2141580 is 302, and in fact $ vdb-dump -R1 -CREAD_LEN SRR2141580
shows READ_LEN: 151, 151
It is not difficult to split this using text processing tools.

Of course, we also provide the ability to split the reads, and that's what we normally recommend. We're just exploring other avenues here, and showing that in fact the output is perfectly useful.

DaGaMs commented 8 years ago

Sorry to keep going about this, but the experiment unfortunately failed. I tried running fastq-dump $HOME/ncbi/public/sra/SRR2141580 without splitting or compressing, but fastq-dump still started using huge amounts of memory. I gave it 15G, which it used up within 20 minutes and then died. So, whatever the problem is, it's not related to splitting or compressing.

kwrodarmer commented 8 years ago

Okay, then let us work on it for a bit and we'll be back with you.

However, I'm confused - you gave it 15G of running space? I thought before you were talking in terms of 70G and beyond?

DaGaMs commented 8 years ago

So, a few of the runs finished with a max memory usage of 105G, but most of them still failed after they exceeded 120G of memory... I'll try again with 256G each...

osris commented 8 years ago

We are producing cache files that will speed the process and reduce mem use. I'll check on the progress in a bit.

-Chris

On Jan 15, 2016, at 5:45 AM, Benjamin Schuster-Boeckler notifications@github.com<mailto:notifications@github.com> wrote:

So, a few of the runs finished with a max memory usage of 105G, but most of them still failed after they exceeded 120G of memory... I'll try again with 256G each...

— Reply to this email directly or view it on GitHubhttps://github.com/ncbi/sra-tools/issues/23#issuecomment-171930933.

kwrodarmer commented 8 years ago

Sorry that this issue didn't get updated before - it went into our internal NCBI issue tracking DB and didn't get updated here. But in general, the files that were giving these problems have been reloaded. Please let us know if this is still a problem for you.

kwrodarmer commented 8 years ago

When running against the reloaded data, there does not appear to be any further error

sam-dump -r SRR2141560 | samtools view -b -S - > SRR2141560.bam

This works.

hammer commented 8 years ago

@osris @kwrodarmer i'm seeing a nearly identical error on SRR2170057:

[E::sam_parse1] SEQ and QUAL are of different length [W::sam_read1] parse error at line 76842889 [main_samview] truncated file

osris commented 8 years ago

this one had not been reloaded. it was only released a couple weeks ago. must have been missed in our first scan. reloading and checking for others in this study.

hammer commented 8 years ago

@osris thanks for looking into it! we're having problems converting several of the files in this collection to BAMs, including SRR2177298, SRR2177289, SRR3198441, SRR2177293. Some of these die with a segmentation fault in sam-dump, others die in samtools view.

hammer commented 8 years ago

Few more that look busted: SRR3198427, SRR2177269

osris commented 8 years ago

@hammer no need to send more examples. we are checking to find out why we missed these when we scanned for inconsistent seq/qual lengths (BWA-mem plays fast and loose with sam-spec). i'm running a test on reloaded SRR2170057 right now and i'll queue the others after verifying the fix.

hammer commented 8 years ago

@osris how's the cleanup of these files coming? I have another set of IDs of broken files to send you if you're interested!

hammer commented 8 years ago

@osris Any updates? We'd really like to be able to use this data.

osris commented 8 years ago

@hammer apologies. this dropped off my radar. i'm running the reloads now. i'll check progress over the weekend.

osris commented 8 years ago

@hammer reprocessing is complete. please let me know if you encounter any issues.