rajewsky-lab / mirdeep2

Discovering known and novel miRNAs from small RNA sequencing data
GNU General Public License v3.0
137 stars 49 forks source link

bwa_sam_converter.pl split loop error #96

Closed smahaffey closed 1 year ago

smahaffey commented 2 years ago

I've been using a pipeline where we align our small RNA and use that genome alignment to feed into multiple miRNA prediction applications. Using bwa_sam_converter.pl on some of our samples results in a split loop error. It seems to occur when the list of sequence IDs that match to a sequence is just too long. At some point Perl seems to recognize it as an infinite loop resulting in this error. I have a version where I convert just that section to store an array in the hash instead of the comma separated string and avoid splitting a long string. I can submit a pull request with the changes if you'd like.

Drmirdeep commented 2 years ago

It is unclear what the error cause shall be. To me it sounds your IDs are too long whatever long means here. If that is the case and perl crashes then it is better to solve the ID length than changing anything else before. In any case please give some example so the error can be reproduced.

smahaffey commented 2 years ago

Thank you for taking a look at this. Maybe the length of the IDs is a problem when that string becomes excessively long. I didn't do anything to the IDs they are directly from the Fastq from the sequencing core. I would prefer to not have to change the IDs on terabytes of data. I will upload a Bam file(s) to our website and post a link(s) soon so that you can download it and convert to a sam file and replicate the problem if that's alright? As a bam it will be probably 3-9 GB. I'm currently going back to find some of the files that reproduce the error. I don't get the error with the changes I made, but I know which batches of sequencing data started to have problems so it just might take me another day to find some examples.

I made a very simple modification so the IDs don't have to be split out of a long string and that has eliminated the error, while providing an equivalent array of IDs in the end end. See below.

Current method:

sub sam_reads_collapse{
 ...
      while(<IN>){
                next if(/^@/);
                @line=split();

            ...
                if(not $seq{$line[9]}){
                        $seq{$line[9]}="$line[0]";
                }else{
                        $seq{$line[9]}.=",$line[0]";
                }
        }
       ....

        for my $k( keys %seq){
               # ERROR: split below results in a split loop error on some of our files.
                @ids=split(",",$seq{$k}); # <-- ERROR
                ...
        }
       ...
}

Modified:

sub sam_reads_collapse{
        ...
        while(<IN>){
                next if(/^@/);
                @line=split();
                ...           
                if(not $seq{$line[9]}){
                        my @tmp=($line[0]);
                       $seq{$line[9]}=\@tmp;
                }else{
                       push @{ $seq{$line[9]} },$line[0];
                }
        }
        ...

        for my $k( keys %seq){
                my $idRef=$seq{$k};
                @ids=@$idRef;
               ...
        }
        ...
}
mschilli87 commented 2 years ago

Thank you for providing more information. When sharing input data, please try to keep it to the minumum necessary to reproduce. Soif this can be triggered by a single BAM record with a long name, please don't make us download and process 3 GB of additional data. :wink:

smahaffey commented 2 years ago

Thank you both very much. Unfortunately it's caused by a very large number of reads having the same sequence so I've filtered it to a sam/bam containing only those reads that have exactly the same matching sequence. It is still a large file. For this sample the original bam is 9.9GB compressed. The filtered example is a 1.2 GB bam and a 13.6GB sam just in case it's easier for you to not convert it, but it's much larger to download.

Our data for small RNA is actually everything under 200bp and not just selected for miRNA. I blasted the first problem sequence I found which is the sequence in this example file and it overlaps with a snoRNA. Yes, this is a very excessive number of reads that all have the same sequence, but it is problematic to filter all of these out because we don't just have a few samples. We have hundreds of samples across multiple strains and tissues that I'm running this pipeline on. The modified script seems to generate the .arf and .fa files appropriately for input into mirDeep2. However I assume if I could filter this out it would speed up the mirDeep2 runs.

Bam https://phenogen.org/downloads/example.bam

Sam https://phenogen.org/downloads/example.sam

Command used to get the error: perl bwa_sam_converter.pl -c -i example.sam -o reads.fa -a alignment.arf

mschilli87 commented 2 years ago

If you don't mind me asking, what are you trying learn from mapping 'a very large number of reads having the same sequence' to the genome? The standard miRDeep2 to workflow uses the collapse_reads_md.pl script on the input FASTA. This not only saves valuable (computation) time when mapping, but also during quantification. Also, it reduces the disk space required to store the (intermediate) results and would probably solve your problem as a bonus for free. :wink:

mschilli87 commented 2 years ago

Also, I was not able to reproduce the error your running the following commands:

wget https://phenogen.org/downloads/example.bam
samtools view -H example.bam > example.sam
samtools view example.bam >> example.sam
bwa_sam_converter.pl -c -i example.sam -o reads.fa -a alignment.arf

Instead, I got valid reads.fa and alignment.arf files containing a single (distinct) read with 30 (different) alignments:

cat reads.fa
>seq_1_x59739030
ATCGATGATGATTTCCAATAAAACATTCCTTGGAAAGCTGAACAAAATGAG
tee >(head -n2) < alignment.arf | wc -l
seq_1_x59739030 51      1       51      atcgatgatgatttccaataaaacattccttggaaagctgaacaaaatgag     chr1     51      110467219       110467269       atcgatgatgatttacaataaaacattccttggaaagctgaacaaaatgag      -       1       mmmmmmmmmmmmmmMmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
seq_1_x59739030 51      1       51      atcgatgatgatttccaataaaacattccttggaaagctgaacaaaatgag     chr1     51      110477372       110477422       atcaatgatgatttccaataaaacattccttggaaagctgaacaaaatgag      -       1       mmmMmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
30

Note that running bwa_sam_converter.pl on these data required more than 8 (but less than 16) GB of RAM.

smahaffey commented 2 years ago

Thank you so much for continuing to maintain this resource and taking time to look at the problem. I assume this is a fairly unique problem, but without those changes we can't easily run it. I'm sorry if this is somehow automatically fixed with a newer version of perl. After recently running into this problem again with the current version of miRDeep2 and then looking back at an old version of the script that we used previously(~2015), I remembered modifying it then too. We have it working for us. I just thought I would pass along the changes. Yes these files require a lot of RAM to process and I appreciate that was obviously considered when writing miRDeep2. Other software has not been mindful of memory usage and will not run a sample even on our 384GB machine.

It's odd that you don't get this error with the example I sent. I'm sorry to waste your time on it. It is occurring now on a RHEL 7 machine with perl 5.16.3. Maybe the split loop error has been fixed/improved in more recent versions as I'm aware this is an older version of perl now. This also occurred years ago with rn6 on our RHEL 6 server. I can't look back at that perl version.

I must be missing something, collapse_reads_md.pl would be run on the reads FASTA file after I've converted the sam file to the reads.fa and alignment.arf which is where the problem occurs on some samples. I don't see how that helps prior to conversion. Also isn't the -c option collapsing the reads in the output fasta?

To answer your other question about why we are doing this. We aren't intentionally sequencing the same sequence and there were ~800,000 other unique read sequences in the original file. For this sample, as requested, I did filter the file down to a minimal set that replicated the problem. The original bam file was 9.9GB which consists of many other sequences. We are looking at reconstructing the transcriptome in a genetically diverse but reproducible population of the Hybrid Rat Diversity Panel. For the small RNA, we've sequenced most small RNAs below 200bp. We use high read depth(20-80 million reads per sample) to push to detect lower abundance small RNAs. As the majority of reads hit miRNAs we start with aligning to miRNAs. Initially with each genome version we also try to run miRNA prediction as the Rat is very under annotated. In some samples an unfortunately large proportion of reads seem align to a few sequences. Initially the miRNA predictions are just used to build a database of potential miRNAs(known and novel) and then we collapse all the mature miRNAs with identical sequences to realign all the reads and quantify each mature miRNA on each sample so we can look for differences in expression across differing genetic backgrounds. We do start filtering out samples that are outliers at this point after alignment to mature miRNAs. It's harder to determine which samples should be filtered for prediction, except where sequencing was obviously not great.

mschilli87 commented 2 years ago

You're welcome. Thanks for reporting your issue and following up.

Just for future reference, I used perl 5.34.0 for my test.

You are right that the -c option should have the same result as running collapse_reads_md.pl on the output FASTA. However my question was as to why you did not collapse sequence before mapping. Somehow you must have produced you BAM files from FASTQ (or FASTA) files to begin with. Collapsing identical sequences at this stage would speed up all downstream step and most likely avoid this issue without the need to modify any miRDeep2 script.

I understand that the data you provided was just a subset. I'll try express myself more clearly: When mapping your (full) raw reads input to obtain the (full) BAM file, you basically asked the mapper (I assume bwa) the same question (amongst others) 59,739,030 times: 'Where does the sequence ATCGATGATGATTTCCAATAAAACATTCCTTGGAAAGCTGAACAAAATGAG map to?'. I propose that by asking it only once (by collapsing identical read pre-mapping), you'd saved yourself some (waiting) time (and some electricity for your (employer's/funder's) wallet / the environment) while essentially getting the exact same information. Additionally, your BAM/SAM file would be much smaller and collapsing at the stage of BAM-to-FASTA conversion would not even be required anymore.

I agree that your work flow should work (and the -c flag exists specifically to support it, e.g. when downloading already mapped data that had not been collapsed before rather than starting with raw reads) but since you seem to analyze a lot of sRNA-seq data with many tools, early collapsing would probably be beneficial for you in more ways than just working around a bug in an older version of Perl triggered by an implementation detail in miRDeep2.

Just to clarify: I am not suggesting to exclude respective reads from the analysis, but rather collapsing all their instances into a single one while keeping track of the initial count of occurrences.

I'll leave it to @Drmirdeep to decide if fixing this corner case is worthwhile the risk of potentially introducing another bug. :wink:

smahaffey commented 2 years ago

Thank you so much. Yes you have a valid point about saving computation time/storage/electricity/money. The problem is that while miRDeep2 is designed thoughtfully, other software does not seem to handle reporting the number of reads in the same manner(from the fasta ID line) or any manner other than reads in the BAM/SAM file so it makes it difficult to only align collapsed reads and use the collapsed read alignment in other software where the read depth is an important component of the algorithm. Many other applications would see the read depth as 1 or something vastly under estimated after being given a BAM/SAM file of collapsed reads. Probably this idea should be incorporated into a new SAM file format specification so new aligners could output collapsed alignments in a standardized way, that other software could be modified/designed to use.

Again thank you both(@mschilli87 and @Drmirdeep) for your help and continuing to maintain this software. I know it can be a time consuming and thankless job.

Drmirdeep commented 1 year ago

This won’t be fixed by me. I consider the script as deprecated !