nf-core / fastquorum

Pipeline to produce consensus reads using unique molecular indexes/barcodes (UMIs)
https://nf-co.re/fastquorum
MIT License
20 stars 9 forks source link

GroupReadsByUmi failing on one sample #53

Closed SPPearce closed 2 months ago

SPPearce commented 4 months ago

Description of the bug

This may be related to #52, but posting it separately as I'm not sure.

I’m finding this error on one of my 8 duplex samples on GroupReadsByUmi:

  [2024/05/28 06:12:25 | FgBioMain | Info] GroupReadsByUmi failed. Elapsed time: 0.06 minutes.
  Exception in thread "main" java.lang.IllegalStateException: A01659:139:HT77KDRX3:1:2160:16260:22326 did not have a primary R1 record.

which is odd to me, because that bam file contain two reads with A01659:139:HT77KDRX3:1:2160:16260:22326:

A01659:139:HT77KDRX3:1:2160:16260:22326 163 chr1    10034   60  9M1D103M1D25M7S =   10034   139 CCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACAGTACGG    FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF:FFFFF:FF:FFFFF:FFFFFFFF:FFFFF:FF::F:FFFFF,FF:F::F,,:F,,FF,,:,,,,    XA:Z:chr3,+10442,9M1D5M1D43M1I71M15S,4;chr4,-190122667,11S35M4I20M1D74M,7;chr1,-248946041,7S16M1D77M1D12M1D32M,7;   MC:Z:7S9M1D103M1D25M    MD:Z:9^T103^C25 NM:i:2  MQ:i:60 AS:i:123    XS:i:102    RX:Z:CAGTA-AATGC    RG:Z:A
A01659:139:HT77KDRX3:2:2214:9299:21261  163 chr1    16440   41  144M    =   16440   144 TCTACAGTTTGAAAACCACTATTTTATGAACCAAGTAGAACAAGATATTTGAAATCGAAACTATTCAAAAAATTGAGAATTTCTGACCACTTAACAAACCCACAGAAAATCCACCCGAGTGCACTGAGCACGCCAGAAATCAGG    FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF    XA:Z:chr9,+16551,144M,2;chr16,+16123,144M,2;chr2,-113596853,144M,2;chr15,-101974377,144M,2;chr12,+16555,144M,2;chrX,-156023484,144M,3;chr1,+186962,144M,4;chr12_GL877875v1_alt,+6555,144M,2;    MC:Z:144M   MD:Z:55G88  NM:i:1  MQ:i:41 AS:i:139    XS:i:134    RX:Z:TGTGC-AAGGA    RG:Z:A-6B738825
A01659:139:HT77KDRX3:1:2160:16260:22326 83  chr1    10034   60  7S9M1D103M1D25M =   10034   -139    TATGCCTCCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC    ,:F:,,FFFFFFFFF,FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF    XA:Z:chr3,-10508,7S137M,7;chr4,+190122662,4M1D35M4I20M1D74M7S,8;chr3,-10458,21S43M1I71M8S,2;chr1,+248946041,16M1D77M1D12M1D26M13S,5;    MC:Z:9M1D103M1D25M7S    MD:Z:9^T103^C25 NM:i:2  MQ:i:60 AS:i:123    XS:i:105    RX:Z:CAGTA-AATGC    RG:Z:A

All 8 of these samples were sequenced over two lanes, so they are merged together. Curiously this is the only file that fails in this way, the other 7 samples are fine.

If I manually sort the merged bam file, then GroupReadsByUmi will resort the bam file itself and then work correctly.

Command used and terminal output

No response

Relevant files

No response

System information

Running fastquorum v1.0.0 on Nextflow 23.10.1 with apptainer as the container engine.

nh13 commented 4 months ago

I think it’s absolutely related. One temporary fix would be to swap in fgbio SortBam for template coordinate merging for now. It isn’t as fast (not multithreaded) but would work when merging lanes. Perhaps even better as a stop gap would be just o use samtools sort, which works, to re-sort after the merge?

SPPearce commented 4 months ago

Ok, thanks. Currently samtools merge is being ran non-multithreaded anyway, at least the first time (process_low, and it uses task.cpus-1).

nh13 commented 4 months ago

I don’t think we have a merge tool in fgbio, so it’ll have to re-sort.

SPPearce commented 4 months ago

Ok. My surprise is that it worked on 7/8 samples. I think there is an issue with using igenomes too, but I'll dig into that on Monday.

nh13 commented 2 months ago

You're right, this is relate tot #52 and https://github.com/samtools/samtools/pull/2062

lauren-tjoeka commented 2 months ago

Hi I'm new to nf-core workflows and I'm also encountering this bug.

[2024/07/31 08:02:03 | FgBioMain | Info] GroupReadsByUmi failed. Elapsed time: 0.05 minutes. Exception in thread "main" java.lang.IllegalStateException: A00232:194:H2LGVDSXC:1:1671:4797:18004 did not have a primary R1 record.

Could you elaborate on how to swap in 'fgbio SortBam'? Is this something I can specify in my config file?

I think it’s absolutely related. One temporary fix would be to swap in fgbio SortBam for template coordinate merging for now. It isn’t as fast (not multithreaded) but would work when merging lanes. Perhaps even better as a stop gap would be just o use samtools sort, which works, to re-sort after the merge?

Thanks!

SPPearce commented 2 months ago

This was fixed in the branch that @nh13 had made, but he seems to have deleted it now. The upstream fix is in samtools, but samtools haven't made a release yet. Nils, I think we should release a 1.0.1 version sooner than samtools might actually get round to releasing it.

SPPearce commented 2 months ago

Could you elaborate on how to swap in 'fgbio SortBam'? Is this something I can specify in my config file?

No, you can't do this with a config, it requires an edit to the workflow itself.

nh13 commented 2 months ago

@SPPearce here's the closed PR: https://github.com/nf-core/fastquorum/pull/54. I was hoping that samtools would be released by now, but its volunteer so I can relate. I've asked for a release from here: https://github.com/samtools/samtools/issues/2090. perhaps we wait a few days and then do a release?

SPPearce commented 2 months ago

Fixed in #68