jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
182 stars 27 forks source link

Poorly formatted SAM record error #108

Open rotoke opened 7 months ago

rotoke commented 7 months ago

Hello,

I am analysing a series of multifasta files with ecc_finder, and one of them produces the following error at the Genrich step: Error! seq_2125767: poorly formatted SAM/BAM record (I can reproduce the error by running Genrich -t tmp.sam -o sample.Peak -v on the sorted SAM file)

I managed to trace down the error to two alignments within the SAM file, which come from the longest (94247 bp) and second longest (77851 bp) fasta read respectively. However, I can't see any obvious problem with these entries:

SAM header:

@HD     VN:1.6  SO:queryname
@SQ     SN:chr1A        LN:602900890
@SQ     SN:chr1B        LN:697493198
@SQ     SN:chr1D        LN:490518203
@SQ     SN:chr2A        LN:784661008
@SQ     SN:chr2B        LN:810500911
@SQ     SN:chr2D        LN:655314739
@SQ     SN:chr3A        LN:752710991
@SQ     SN:chr3B        LN:890847171
@SQ     SN:chr3D        LN:621781073
@SQ     SN:chr4A        LN:743084022
@SQ     SN:chr4B        LN:676741658
@SQ     SN:chr4D        LN:509452426
@SQ     SN:chr5A        LN:710124532
@SQ     SN:chr5B        LN:480767623
@SQ     SN:chr5D        LN:578021311
@SQ     SN:chr6A        LN:620140791
@SQ     SN:chr6B        LN:716573881
@SQ     SN:chr6D        LN:476726550
@SQ     SN:chr7A        LN:756324664
@SQ     SN:chr7B        LN:977471539
@SQ     SN:chr7D        LN:642207261
@SQ     SN:chrUn        LN:464691063
@PG     ID:minimap2     PN:minimap2     VN:2.22-r1101   CL:minimap2 -ax map-ont -t 36 ref.mmi tmp.fa
@PG     ID:samtools     PN:samtools     PP:minimap2     VN:1.11 CL:samtools sort -n -@ 36 -o BC29.tmp.sorted.sam BC29.tmp.sam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.11 CL:samtools view -H -o header.txt BC29.tmp.sorted.sam

Head of malformatted alignment:

seq_2125764     16      chr3D   448389107       60      49766S22M1I138M3I68M1D57M1I9M1D104M1D14M1D14M1D80M1I33M3I12M1D22M2I164M1I43M1I14M1D2M1D20M1D30M1D118M1D2M2D4M1I48M1D139M3D29M1D39M1I9M1D23M2D30M2I17M1I13M2D214M1D103M1D35M1I3M1D88M2D44M2D111M2I28M1I28M1I36M2D81M1I33M2D133M1I72M1I49M1D38M1I74M1D161M1I84M1I2M2D6M1I4M1D135M1D13M3I41M2D58M1I15M1D211M2I12M2D29M1D52M2D35M3D5M1D39M2I50M1I44M1I16M1I71M1D42M7D10M1D14M1I18M1D19M1D3M1I4M1D7M1D1M2D86M1D11M3I91M1D11M1D25M3D62M1I96M1D6M2I15M1I2M1D105M1D2M1D17M3D231M1D16M1I44M1I4M1D7M1I94M4I39M1D4M2D54M1D32M1I33M1D25M1D83M1D45M1D7M1I124M1D27M2I3M1I72M1I73M1D1M1D46M1D5M2D17M1I13M1I3M2D48M1D8M1D1M1D2M2I5M1I2M1D43M1I9M1I30M1I35M1D3M1D2M1I54M1D4M1D59M1D8M1I51M1I98M1D3M1D4M2I2M1D5M1I133M1D9M1D37M1I153M3I61M1D20M1I79M2I4M1D2M6D28M1I1M3D29M1I1M2D11M1I14M1I8M1I38M3I29M3D52M3D29M1I30M1D74M4D5M1D39M2I7M1D5M1I1M1I208M1D68M1I4M1D212M3D1M1D21M2I3M1I59M3D27M2D54M2I60M1I148M1D28M2I43M2D3M1D6M1I9M2I121M5D49M1D8M1D4M1D19M4D1M1D7M1D21M4D57M1I17M2D17M1I58M1I6M1D11M1I3M5D28M1D34M1D48M1I18M1D20M1I5M1D5M1D66M1D62M1D44M4D45M3D33M4I1M1I4M3D8M2D27M1I170M1D31M1I6M2D54M1I2M1I41M1I88M1D208M2D66M6D2M1D5M3D4M2D44M1I2M3D38M2D5M2D132M1I3M1D2M1D37M1D9M1I15M1I42M1D13M2D40M2D43M1I4M2I27M1I154M1I37M1D93M1I84M1I56M2D7M1D35M1I91M1D17M1I31M2I9M1I2M1D14M1D3M1D18M2I30M2D5M1I32M1D1M1D184M1D128M1D117M2D3M2D59M1D5M1I1M2D33M1D213M1D297M1D15M32934S       *       0       0  TCCGGGATGAAGCAA[...]

Tail of malformatted alignment:

[...]GCAGAATTAGCAATATAC  *       NM:i:675        ms:i:19402      AS:i:19274      nn:i:0      tp:A:P  cm:i:664        s1:i:4980       s2:i:1648       de:f:0.0467    SA:Z:chr3D,448389684,+,44476S11546M106D38225S,60,896;chr6D,20519259,-,12945S5315M2I75985S,47,158;chr2D,145708270,-,56695S7705M2947D29847S,1,5387;chr2A,570675159,-,75864S4446M8D13937S,28,86;chr3D,424338680,-,61759S4162M26D28326S,56,100;chr6D,20521380,+,72475S3521M17D18251S,17,66;chr4A,226979878,+,64537S2728M2D26982S,60,43;chr7D,360787811,-,7011S2278M33D84958S,60,82;chr2B,65706382,+,18519S2526M7I73195S,60,167;chr4A,429646387,+,57685S1873M129D34689S,60,192;chr7A,295502993,-,1707M18D92540S,52,58;chr4A,429646091,-,33105S1597M134D59545S,60,222;chr4D,427325202,+,88759S1615M27D3873S,60,130;chr3D,448399177,-,36876S1355M34D56016S,60,110;chr7A,295503463,+,91418S1126M4D1703S,51,20;chr1A,494282726,-,11484S1298M4I81461S,60,131;chrUn,354425073,+,654S1645M824D91948S,60,1003;chr7B,753764316,-,88326S900M1169D5021S,60,1198;chr4D,427325676,-,5484S939M3I87821S,60,71;chr1A,427830431,+,3388S1096M41D89763S,1,183;chr2B,397340602,+,90617S673M1D2957S,4,9;chr6A,399076249,+,5565S548M1D88134S,11,4;chrUn,354425073,+,2299S544M3D91404S,9,16;chr6A,95900292,+,63716S486M3D30045S,26,8;chr2A,195924772,+,67430S410M1I26406S,13,5;chr4B,24305486,+,87403S1018M198D5826S,1,406;chr5D,453720152,+,69841S298M3D24108S,31,7;chr5D,453720152,+,70733S297M4D23217S,7,11;chr5D,453720152,+,70139S298M3D23810S,1,13;chr5D,453720152,+,70437S296M5D23514S,12,12;chr1D,397361541,-,22917S299M1D71031S,4,14;chr5D,453720152,+,71330S297M5I22615S,59,18;chr2B,355970564,-,31726S202M62319S,60,0;chr2B,355970564,-,31523S202M62522S,60,0;chr2B,355970562,-,31316S204M2I62725S,60,2;chr2B,355970562,-,32131S203M1D61913S,6,1;chr2B,355970562,-,31928S203M1D62116S,60,1;chr2B,355970564,-,30709S202M1I63335S,60,1;chr2B,355970562,-,32334S202M2D61711S,60,2;chr2B,355970562,-,32536S203M1D61508S,60,2;chr6B,142231283,+,71967S194M1D22086S,29,2;chr2B,355970564,-,31114S202M62931S,27,7;chr7B,194962867,+,68209S176M25862S,60,0;chr2B,355970564,-,30913S200M2D63134S,60,12;chr6B,142231283,+,72161S187M1D21899S,36,6;chr2B,355970562,-,32739S171M1D61337S,58,3;chr7B,194962961,+,68127S82M26038S,42,0;chr4B,353242954,+,71903S65M22279S,1,2;  rl:i:21781

Do you have any idea what could have gone wrong here?

Thank you very much, Roman

rotoke commented 6 months ago

If it helps anyone: the alignments producing errors were longer than the default 65520 bp hard-coded in Genrich.h. Increasing this value and recompiling solved the issue. I'd still be interested whether this limit has been set for a particular reason?

Thanks, Roman

jsh58 commented 4 months ago

Great to hear that you diagnosed and solved the problem! I don't recall setting that limit for any particular reason, except that it is far larger than a typical SAM record for a short read. I didn't anticipate long reads being used in genome enrichment assays.