multi - mapper warning - Githubissues

itmat / rum

RNA-Seq Unified Mapper

http://cbil.upenn.edu/RUM

MIT License

26 stars 4 forks source link

multi - mapper warning #120

Closed khayer closed 12 years ago

khayer commented 12 years ago

RUM-Pipeline-v2.0.2_02 just finished on a few of my samples, but two of them had this warning in the postprocess:

2012/09/05 08:49:10 WARN  - Looks like there's
2012/09/05 08:49:10 WARN  - a multi-mapper in the RUM_Unique file. 4971221 () seq.4971221a  chr2    178082532-178082573, 178083765-178083822    +   GAGGATATGATGGTTACAATGAAGGAGGAAATTTTGGCGGTG:GTAACTATGGTGGTGGTGGGAACTATAATGATTTTGGAAATTATAGTGGACAACAGCA
2012/09/05 08:49:10 WARN  - Looks like there's
2012/09/05 08:49:10 WARN  - a multi-mapper in the RUM_Unique file. 4971221 () seq.4971221b  chr2    178083875-178083887, 178083975-178084042, 178084136-178084155   +   AGTCCCTATGGTG:GTGGTTATGGATCTGGTGGTGGAAGTGGTGGATATGGTAGCAGAAGGTTCTAAAAACAGCAGAAAAGG:GCTACAGTTCTTAGCAAGAG
2012/09/05 08:54:42 WARN  - Looks like there's
2012/09/05 08:54:42 WARN  - a multi-mapper in the RUM_Unique file. 19922842 () seq.19922842a    chr17   76411033-76411108, 76415627-76415650    +   GTCAGTTCACCGGGACCTGGAGGCCCAGATTGCGATCGTGACGGAGAACCAGGCCCTGCAGCAGCAGCTTCACCAG:GAGCAAGAGCAGCTCTACCTGAGG
2012/09/05 08:54:42 WARN  - Looks like there's
2012/09/05 08:54:42 WARN  - a multi-mapper in the RUM_Unique file. 19922842 () seq.19922842b    chr17   76415684-76415756, 76420142-76420167    +   CCGAGTCGCCAGGTGAAGCTGTGGGTGAAGATGGTGACTCCACTGATCAAGAACTTCTTCTGAGGACAGACAG:GAATGGCCTTGATGAAGATGACAGGC

mdelaurentis commented 12 years ago

These do not actually seem to be non-unique mappers. I just looked at the results from a job Greg ran that had the same problem. In that case, there were five read pairs that had the issue. None of them were actually multi-mappers; they were all just printed twice in the RUM_Unique file. So the issue is not that there are multi-mappers in the RUM_Unique file, but that a handful of unique mappers are being printed twice. I ran a job that had just these reads in it, and could not reproduce the problem. The reads were only printed once. So I'm still trying to reproduce the issue so I can figure out what's causing it.

Here are the relevant mappings from Greg's job:

seq.216848632a  chr2    73747921-73747998, 73749012-73749033    +       CTCTCCAGTCCTAGTCTCTGGCCGAGATAACACTGATGCAGAAATTGGTCTATATGCAACTCTGGATCCAGCTCGGAT:CAGAGCGGGGGTGCGGGCGAGC
seq.216848632a  chr2    73747921-73747998, 73749012-73749033    +       CTCTCCAGTCCTAGTCTCTGGCCGAGATAACACTGATGCAGAAATTGGTCTATATGCAACTCTGGATCCAGCTCGGAT:CAGAGCGGGGGTGCGGGCGAGC
seq.216848632b  chr2    73749066-73749117, 73749299-73749347    +       CGCGGCGTGGATCTAGGTGACAGGCGACGTGGGCTCCTCTCCCGCTTCCTCT:CTTCGCTTACCTCCCGAGGCGGCGGCAGTGGCGGCGGCAGAGGTCGAAG
seq.220849398a  chr5    4922127-4922175, 4949532-4949582        -       TTATTCCATGCTTGTCTAAGGCTTTTAGAGCTGTACACGGTAAAGCGTT:CTGGCTTAAAATGTGGTAAAGAATGAATTCCAGGCCACGTGTCCTCATTCG
seq.220849398b  chr5    4806617-4806717 -       CGGGCAGAAGAGAACCGATTCAGGGCCCTCGCACTGAGCCACTTCATTAGTGGTTTTTATTGATTGATTTATTTTTATCTCTTCTTCTTCTTCCCAGAATG
seq.220849398a  chr5    4922127-4922175, 4949532-4949582        -       TTATTCCATGCTTGTCTAAGGCTTTTAGAGCTGTACACGGTAAAGCGTT:CTGGCTTAAAATGTGGTAAAGAATGAATTCCAGGCCACGTGTCCTCATTCG
seq.221243188a  chr5    121656769-121656819, 121657156-121657204        -       ATCACTCCCGGAACTGTCCTGATCATCCTCACTGGGCGCCACAGGGGCAAG:AGAGTGGTTTTCCTGAAGCAGCTGGACAGTGGCTTGCTGCTTGTGACTG
seq.221243188b  chr5    121655834-121655864, 121656676-121656745        -       TGGCACCCGGGTGGTGAAGCTTCGAAAAATG:CCTAGGTATTATCCCACCGAAGATGTTCCTCGGAAGCTGCTGAGCCACGGCAAGAAGCCCTTCAGCCAGC
seq.221243188a  chr5    121656769-121656819, 121657156-121657204        -       ATCACTCCCGGAACTGTCCTGATCATCCTCACTGGGCGCCACAGGGGCAAG:AGAGTGGTTTTCCTGAAGCAGCTGGACAGTGGCTTGCTGCTTGTGACTG
seq.265992989a  chr13   97379880-97379886, 97382216-97382308    +       CGATAGG:GAAGTGGAAAAGAGGAGACGAGTGGAGGAAGCGTACAAGAATGTGATGGAAGAACTTAAGAAGAAACCCCGTTTCGGAGGGCCGGATTATGAA
seq.265992989a  chr13   97379880-97379886, 97382216-97382308    +       CGATAGG:GAAGTGGAAAAGAGGAGACGAGTGGAGGAAGCGTACAAGAATGTGATGGAAGAACTTAAGAAGAAACCCCGTTTCGGAGGGCCGGATTATGAA
seq.265992989b  chr13   97384758-97384815, 97386412-97386454    +       AGAGTTCTTTGATGCTGTTGAAGCTGCTCTTGACAGACAAGATAAAATAGAGGAACAG:TCACAGAGTGAAAAGGTCAGGTTACACTGGCCCACATCATTGC
seq.241861100a  chrM    1220-1320       +       GGAAAGATGAAAGACTAATTAAAAGTAAGAACAAGCAAAGATTAAACCTTGTACCTTTTGCATAATGAACTAACTAGAAAACTTCTAACTAAAAGAATTAC
seq.241861100b  chrM    1457-1556       +       AAATGAATTTAAGTTCAATTTTAAACTTGCTAAAAAAACAACAAAATCAAAAAGTAAGTTTAGATTATAGCCAAAAGAGGGACAGCTCTTCTGGAACGGA
seq.241861100b  chrM    1457-1556       +       AAATGAATTTAAGTTCAATTTTAAACTTGCTAAAAAAACAACAAAATCAAAAAGTAAGTTTAGATTATAGCCAAAAGAGGGACAGCTCTTCTGGAACGGA

delagoya commented 12 years ago

Perhaps this is a problem that arises upstream. E.g. you are splitting the files and copying a read into two chunks, then the merge script puts in duplicate lines.

Just a thought. -angel On Sep 10, 2012, at 11:07 AM, Mike DeLaurentis wrote:

These do not actually seem to be non-unique mappers. I just looked at the results from a job Greg ran that had the same problem. In that case, there were five read pairs that had the issue. None of them were actually multi-mappers; they were all just printed twice in the RUM_Unique file. So the issue is not that there are multi-mappers in the RUM_Unique file, but that a handful of unique mappers are being printed twice. I ran a job that had just these reads in it, and could not reproduce the problem. The reads were only printed once. So I'm still trying to reproduce the issue so I can figure out what's causing it.

Here are the relevant mappings from Greg's job:

seq.216848632a chr2 73747921-73747998, 73749012-73749033 + CTCTCCAGTCCTAGTCTCTGGCCGAGATAACACTGATGCAGAAATTGGTCTATATGCAACTCTGGATCCAGCTCGGAT:CAGAGCGGGGGTGCGGGCGAGC seq.216848632a chr2 73747921-73747998, 73749012-73749033 + CTCTCCAGTCCTAGTCTCTGGCCGAGATAACACTGATGCAGAAATTGGTCTATATGCAACTCTGGATCCAGCTCGGAT:CAGAGCGGGGGTGCGGGCGAGC seq.216848632b chr2 73749066-73749117, 73749299-73749347 + CGCGGCGTGGATCTAGGTGACAGGCGACGTGGGCTCCTCTCCCGCTTCCTCT:CTTCGCTTACCTCCCGAGGCGGCGGCAGTGGCGGCGGCAGAGGTCGAAG seq.220849398a chr5 4922127-4922175, 4949532-4949582 - TTATTCCATGCTTGTCTAAGGCTTTTAGAGCTGTACACGGTAAAGCGTT:CTGGCTTAAAATGTGGTAAAGAATGAATTCCAGGCCACGTGTCCTCATTCG seq.220849398b chr5 4806617-4806717 - CGGGCAGAAGAGAACCGATTCAGGGCCCTCGCACTGAGCCACTTCATTAGTGGTTTTTATTGATTGATTTATTTTTATCTCTTCTTCTTCTTCCCAGAATG seq.220849398a chr5 4922127-4922175, 4949532-4949582 - TTATTCCATGCTTGTCTAAGGCTTTTAGAGCTGTACACGGTAAAGCGTT:CTGGCTTAAAATGTGGTAAAGAATGAATTCCAGGCCACGTGTCCTCATTCG seq.221243188a chr5 121656769-121656819, 121657156-121657204 - ATCACTCCCGGAACTGTCCTGATCATCCTCACTGGGCGCCACAGGGGCAAG:AGAGTGGTTTTCCTGAAGCAGCTGGACAGTGGCTTGCTGCTTGTGACTG seq.221243188b chr5 121655834-121655864, 121656676-121656745 - TGGCACCCGGGTGGTGAAGCTTCGAAAAATG:CCTAGGTATTATCCCACCGAAGATGTTCCTCGGAAGCTGCTGAGCCACGGCAAGAAGCCCTTCAGCCAGC seq.221243188a chr5 121656769-121656819, 121657156-121657204 - ATCACTCCCGGAACTGTCCTGATCATCCTCACTGGGCGCCACAGGGGCAAG:AGAGTGGTTTTCCTGAAGCAGCTGGACAGTGGCTTGCTGCTTGTGACTG seq.265992989a chr13 97379880-97379886, 97382216-97382308 + CGATAGG:GAAGTGGAAAAGAGGAGACGAGTGGAGGAAGCGTACAAGAATGTGATGGAAGAACTTAAGAAGAAACCCCGTTTCGGAGGGCCGGATTATGAA seq.265992989a chr13 97379880-97379886, 97382216-97382308 + CGATAGG:GAAGTGGAAAAGAGGAGACGAGTGGAGGAAGCGTACAAGAATGTGATGGAAGAACTTAAGAAGAAACCCCGTTTCGGAGGGCCGGATTATGAA seq.265992989b chr13 97384758-97384815, 97386412-97386454 + AGAGTTCTTTGATGCTGTTGAAGCTGCTCTTGACAGACAAGATAAAATAGAGGAACAG:TCACAGAGTGAAAAGGTCAGGTTACACTGGCCCACATCATTGC seq.241861100a chrM 1220-1320 + GGAAAGATGAAAGACTAATTAAAAGTAAGAACAAGCAAAGATTAAACCTTGTACCTTTTGCATAATGAACTAACTAGAAAACTTCTAACTAAAAGAATTAC seq.241861100b chrM 1457-1556 + AAATGAATTTAAGTTCAATTTTAAACTTGCTAAAAAAACAACAAAATCAAAAAGTAAGTTTAGATTATAGCCAAAAGAGGGACAGCTCTTCTGGAACGGA seq.241861100b chrM 1457-1556 + AAATGAATTTAAGTTCAATTTTAAACTTGCTAAAAAAACAACAAAATCAAAAAGTAAGTTTAGATTATAGCCAAAAGAGGGACAGCTCTTCTGGAACGGA — Reply to this email directly or view it on GitHub.

mdelaurentis commented 12 years ago

Thanks for the idea. That's a good thought, but it doesn't seem to be the case. The mappings are also duplicated in the individual chunk output files (e.g. RUM_Unique.28). I tried running a job that contained just the offending reads, and they were only printed once. So now I'm running one chunk from Greg's job by itself.

greggrant commented 12 years ago

I believe RUM final cleanup removes duplciated reads from RUM_NU, maybe we just need to run that as well on RUM_Unique? Even if that does fix it, it would be good to know how this issue got introduced, if we can figure it out. Thanks, Greg

On Mon, 10 Sep 2012, Angel Pizarro wrote:

Perhaps this is a problem that arises upstream. E.g. you are splitting the files and copying a read into two chunks, then the merge script puts in duplicate lines.

Just a thought. -angel On Sep 10, 2012, at 11:07 AM, Mike DeLaurentis wrote:

These do not actually seem to be non-unique mappers. I just looked at the results from a job Greg ran that had the same problem. In that case, there were five read pairs that had the issue. None of them were actually multi-mappers; they were all just printed twice in the RUM_Unique file. So the issue is not that there are multi-mappers in the RUM_Unique file, but that a handful of unique mappers are being printed twice. I ran a job that had just these reads in it, and could not reproduce the problem. The reads were only printed once. So I'm still trying to reproduce the issue so I can figure out what's causing it.

Here are the relevant mappings from Greg's job:

seq.216848632a chr2 73747921-73747998, 73749012-73749033 + CTCTCCAGTCCTAGTCTCTGGCCGAGATAACACTGATGCAGAAATTGGTCTATATGCAACTCTGGATCCAGCTCGGAT:CAGAGCGGGGGTGCGGGCGAGC seq.216848632a chr2 73747921-73747998, 73749012-73749033 + CTCTCCAGTCCTAGTCTCTGGCCGAGATAACACTGATGCAGAAATTGGTCTATATGCAACTCTGGATCCAGCTCGGAT:CAGAGCGGGGGTGCGGGCGAGC seq.216848632b chr2 73749066-73749117, 73749299-73749347 + CGCGGCGTGGATCTAGGTGACAGGCGACGTGGGCTCCTCTCCCGCTTCCTCT:CTTCGCTTACCTCCCGAGGCGGCGGCAGTGGCGGCGGCAGAGGTCGAAG seq.220849398a chr5 4922127-4922175, 4949532-4949582 - TTATTCCATGCTTGTCTAAGGCTTTTAGAGCTGTACACGGTAAAGCGTT:CTGGCTTAAAATGTGGTAAAGAATGAATTCCAGGCCACGTGTCCTCATTCG seq.220849398b chr5 4806617-4806717 - CGGGCAGAAGAGAACCGATTCAGGGCCCTCGCACTGAGCCACTTCATTAGTGGTTTTTATTGATTGATTTATTTTTATCTCTTCTTCTTCTTCCCAGAATG seq.220849398a chr5 4922127-4922175, 4949532-4949582 - TTATTCCATGCTTGTCTAAGGCTTTTAGAGCTGTACACGGTAAAGCGTT:CTGGCTTAAAATGTGGTAAAGAATGAATTCCAGGCCACGTGTCCTCATTCG seq.221243188a chr5 121656769-121656819, 121657156-121657204 - ATCACTCCCGGAACTGTCCTGATCATCCTCACTGGGCGCCACAGGGGCAAG:AGAGTGGTTTTCCTGAAGCAGCTGGACAGTGGCTTGCTGCTTGTGACTG seq.221243188b chr5 121655834-121655864, 121656676-121656745 - TGGCACCCGGGTGGTGAAGCTTCGAAAAATG:CCTAGGTATTATCCCACCGAAGATGTTCCTCGGAAGCTGCTGAGCCACGGCAAGAAGCCCTTCAGCCAGC seq.221243188a chr5 121656769-121656819, 121657156-121657204 - ATCACTCCCGGAACTGTCCTGATCATCCTCACTGGGCGCCACAGGGGCAAG:AGAGTGGTTTTCCTGAAGCAGCTGGACAGTGGCTTGCTGCTTGTGACTG seq.265992989a chr13 97379880-97379886, 97382216-97382308 + CGATAGG:GAAGTGGAAAAGAGGAGACGAGTGGAGGAAGCGTACAAGAATGTGATGGAAGAACTTAAGAAGAAACCCCGTTTCGGAGGGCCGGATTATGAA seq.265992989a chr13 97379880-97379886, 97382216-97382308 + CGATAGG:GAAGTGGAAAAGAGGAGACGAGTGGAGGAAGCGTACAAGAATGTGATGGAAGAACTTAAGAAGAAACCCCGTTTCGGAGGGCCGGATTATGAA seq.265992989b chr13 97384758-97384815, 97386412-97386454 + AGAGTTCTTTGATGCTGTTGAAGCTGCTCTTGACAGACAAGATAAAATAGAGGAACAG:TCACAGAGTGAAAAGGTCAGGTTACACTGGCCCACATCATTGC seq.241861100a chrM 1220-1320 + GGAAAGATGAAAGACTAATTAAAAGTAAGAACAAGCAAAGATTAAACCTTGTACCTTTTGCATAATGAACTAACTAGAAAACTTCTAACTAAAAGAATTAC seq.241861100b chrM 1457-1556 + AAATGAATTTAAGTTCAATTTTAAACTTGCTAAAAAAACAACAAAATCAAAAAGTAAGTTTAGATTATAGCCAAAAGAGGGACAGCTCTTCTGGAACGGA seq.241861100b chrM 1457-1556 + AAATGAATTTAAGTTCAATTTTAAACTTGCTAAAAAAACAACAAAATCAAAAAGTAAGTTTAGATTATAGCCAAAAGAGGGACAGCTCTTCTGGAACGGA — Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/120#issuecomment-8426341

mdelaurentis commented 12 years ago

I've figured out what the problem is. It was introduced in 2.0.2_02, it's completely my fault, and unfortunately I think Greg and Katharina should probably rerun their jobs.

When I split "rum_runner align" and "rum_runner resume" into two separate actions, I changed the way we parsed the command line options, in order to not duplicate all the command-line parsing code, since those scripts take many of the same options. One of the things the command line parsing code used to do is set a flag indicating that we're dealing with paired-end data if we get two input files. That flag wasn't being properly set in the new code. Some of the individual scripts depend on that flag in order to determine whether they are dealing with paired-end data.

The duplicate mappings were being printed by make_TU_and_TNU, the script that parses the Bowtie transcriptome mapping output. There's some code in the script that handles single-end reads based on the --single or --paired flag it gets on the command line, and there's some other code that handles paired-end reads based on whether the current read has a 'b' in the name. So what was happening was that both bits of code were running. It thought it was dealing with single-end reads because it was getting the --single flag on the command line, but then it was seeing reverse reads in the input, so it was handling a small number of reads twice.

I apologize for the error and the waste of resources. I have plenty of tests that run the individual scripts, and some tests that run the components of RUM_runner, but unfortunately I didn't have a test that ensured that if "rum_runner align" gets two files on the command line, the --paired flag gets passed to the scripts. So this case just fell through the cracks.

The fix should be pretty simple, but I would like to add some tests that will make sure I don't miss this kind of thing in the future. I should be able to have another release together tomorrow or Wednesday morning; before the cluster comes back up.

greggrant commented 12 years ago

Great sleuthing Mike! Thanks for the detailed explanation, that makes sense. No problem waiting a day or two for the fix, but can we put up some warning on the downlaods page so people don't download this broken version before we can fix it? Thanks, Greg

On Mon, 10 Sep 2012, Mike DeLaurentis wrote:

I've figured out what the problem is. It was introduced in 2.0.2_02, it's completely my fault, and unfortunately I think Greg and Katharina should probably rerun their jobs.

When I split "rum_runner align" and "rum_runner resume" into two separate actions, I changed the way we parsed the command line options, in order to not duplicate all the command-line parsing code, since those scripts take many of the same options. One of the things the command line parsing code used to do is set a flag indicating that we're dealing with paired-end data if we get two input files. That flag wasn't being properly set in the new code. Some of the individual scripts depend on that flag in order to determine whether they are dealing with paired-end data.

The duplicate mappings were being printed by make_TU_and_TNU, the script that parses the Bowtie transcriptome mapping output. There's some code in the script that handles single-end reads based on the --single or --paired flag it gets on the command line, and there's some other code that handles paired-end reads based on whether the current read has a 'b' in the name. So what was happening was that both bits of code were running. It thought it was dealing with single-end reads because it was getting the --single flag on the command line, but then it was seeing reverse reads in the input, so it was handling a small number of reads twice.

I apologize for the error and the waste of resources. I have plenty of tests that run the individual scripts, and some tests that run the components of RUM_runner, but unfortunately I didn't have a test that ensured that if "rum_runner align" gets two files on the command line, the --paired flag gets passed to the scripts. So this case just fell through the cracks.

The fix should be pretty simple, but I would like to add some tests that will make sure I don't miss this kind of thing in the future. I should be able to have another release together tomorrow or Wednesday morning; before the cluster comes back up.

Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/120#issuecomment-8432253

mdelaurentis commented 12 years ago

I'll just delete this version from the downloads page until I get the fixed version together. Unfortunately I don't think I can really customize the text on the download page.

On Mon, Sep 10, 2012 at 2:18 PM, greggrant notifications@github.com wrote:

Great sleuthing Mike! Thanks for the detailed explanation, that makes sense. No problem waiting a day or two for the fix, but can we put up some warning on the downlaods page so people don't download this broken version before we can fix it? Thanks, Greg

On Mon, 10 Sep 2012, Mike DeLaurentis wrote:

I've figured out what the problem is. It was introduced in 2.0.2_02, it's completely my fault, and unfortunately I think Greg and Katharina should probably rerun their jobs.

When I split "rum_runner align" and "rum_runner resume" into two separate actions, I changed the way we parsed the command line options, in order to not duplicate all the command-line parsing code, since those scripts take many of the same options. One of the things the command line parsing code used to do is set a flag indicating that we're dealing with paired-end data if we get two input files. That flag wasn't being properly set in the new code. Some of the individual scripts depend on that flag in order to determine whether they are dealing with paired-end data.

The duplicate mappings were being printed by make_TU_and_TNU, the script that parses the Bowtie transcriptome mapping output. There's some code in the script that handles single-end reads based on the --single or --paired flag it gets on the command line, and there's some other code that handles paired-end reads based on whether the current read has a 'b' in the name. So what was happening was that both bits of code were running. It thought it was dealing with single-end reads because it was getting the --single flag on the command line, but then it was seeing reverse reads in the input, so it was handling a small number of reads twice.

I apologize for the error and the waste of resources. I have plenty of tests that run the individual scripts, and some tests that run the components of RUM_runner, but unfortunately I didn't have a test that ensured that if "rum_runner align" gets two files on the command line, the --paired flag gets passed to the scripts. So this case just fell through the cracks.

The fix should be pretty simple, but I would like to add some tests that will make sure I don't miss this kind of thing in the future. I should be able to have another release together tomorrow or Wednesday morning; before the cluster comes back up.

Reply to this email directly or view it on GitHub: https://github.com/PGFI/rum/issues/120#issuecomment-8432253

— Reply to this email directly or view it on GitHubhttps://github.com/PGFI/rum/issues/120#issuecomment-8432587.