milkschen / leviosam2

Fast and accurate coordinate conversion between assemblies
MIT License
107 stars 17 forks source link

Picard ValidateSamFile halts for 100 `Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped` Error after lifting from chm13v2 to hg38 #71

Open yunkaig opened 6 months ago

yunkaig commented 6 months ago

Hi there,

I have been trying to adapt leviosam2 to out pipeline. This is the process I ran through:

  1. bwa mem alignment to chm13v2
  2. mark duplicates
  3. sort bam & validate bam file was OK then
  4. lift over to hg38 with leviosam2
  5. bqsr & validate bam file throws the error and stops

These are the parameters I used to run leviosam2:

export LVS_SIF="$sw/leviosam2/leviosam2_latest.sif"
export LVS_PY="$sw/leviosam2/leviosam2/workflow/leviosam2.py"
export LVS_CMD="singularity exec ${LVS_SIF} python3 ${LVS_PY}"

${LVS_CMD} \
    -t "${n_threads}" \
    -i "${infile}" # infile is the input bam aligned to chm13v2\
    -o "${outfile_prefix}" \
    -O bam \
    -C "${clft_file}" \
    -f "${ref_fasta}" # ref_fasta is hg38.noalt.decoy.bwa/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna, used in previous pipeline as reference \
    -fi "${ref_fasta}"\
    -s ilmn_pe \
    -a bwamem \
    --use_preset True \
    --forcerun

And the error messages from ValidateSamFile:

ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 6, Read name ST-E00106:225:HHM22CCXX:7:2118:6238:47668, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 47, Read name ST-E00106:225:HHM22CCXX:7:1210:27732:7866, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 169, Read name ST-E00106:225:HHM22CCXX:7:1111:29630:48705, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 172, Read name ST-E00106:225:HHM22CCXX:7:1211:12763:57284, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 177, Read name ST-E00106:225:HHM22CCXX:7:1207:10439:33920, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MATE_CIGAR_STRING_INVALID_PRESENCE:Record 415, Read name ST-E00106:225:HHM22CCXX:7:2218:24515:6267, Mate CIGAR String (MC Attribute) present for a read whose mate is unmapped
ERROR::MISMATCH_MATE_ALIGNMENT_START:Record 415, Read name ST-E00106:225:HHM22CCXX:7:2218:24515:6267, Mate alignment does not match alignment start of mate

... (more like this)

Maximum output of [100] errors reached.

It seems to me that no one has brought up that issue, but I did found a commit to remove MC:z tags for unmapped reads.

Is there anything I have missed?

Thanks in advance, Yunkai

milkschen commented 6 months ago

Hi Yunkai,

Thank you for reporting the issue. As you pointed out, I have made some recent changes to address the issue. There are other ongoing items I am working on, so I have not made a new release that includes the commit.

Seems like this issue is blocking your analysis at the BQSR step? If that's the case, building the software from the main branch might help.

yunkaig commented 6 months ago

Thanks for the quick reply! The issue occurs after the BQSR step was completed, but I will try the main branch to see if it solves the issue.

yunkaig commented 6 months ago

Hi @milkschen,

I have rerun lift over the version built from the main branch. Now instead of the previous error, ValidateSamFile presents me with these errors:

ERROR::MISMATCH_MATE_ALIGNMENT_START:Record 1, Read name ST-E00106:225:HHM22CCXX:7:1118:9557:28558, Mate alignment does not match alignment start of mate
ERROR::MISMATCH_FLAG_MATE_NEG_STRAND:Record 1, Read name ST-E00106:225:HHM22CCXX:7:1118:9557:28558, Mate negative strand flag does not match read negative strand flag of mate
ERROR::MISMATCH_FLAG_MATE_UNMAPPED:Record 1, Read name ST-E00106:225:HHM22CCXX:7:1118:9557:28558, Mate unmapped flag does not match read unmapped flag of mate
ERROR::MISMATCH_MATE_CIGAR_STRING:Record 1, Read name ST-E00106:225:HHM22CCXX:7:1118:9557:28558, Mate CIGAR string does not match CIGAR string of mate

Other reads also at least 3 out of these 4 errors. Again it halts after encountering 100 errors.

Not sure if it helps but here is the read pair where these errors came from:

ST-E00106:225:HHM22CCXX:7:1118:9557:28558       99      chr1    260895  0       150M    =       261280  535     ATCCCAGAACAGAATGCATCATGTGAAAAAGAATTTATGCTACAAATTACTATGGTTTGGATGTGGTTTGTCCCCGCAAAAACTCATGTTGAAATTTGACCCCCAATGTGGCAGTGTGGGGCGGTGGGGCCTAGTGGATGGTGTTTGGGT     ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????5??????????5??????     XA:Z:chr6,-172119695,150M,1;chr19,+205336,150M,4;  MC:Z:78M1I70M   MD:Z:7G7A34G24A24T4C4T27C11     RF:Z:source     RG:Z:HHM22CCXX-7-1EBCC218       NM:i:8  LO:Z:L_L  UQ:i:42  AS:i:145        XS:i:145
ST-E00106:225:HHM22CCXX:7:1118:9557:28558       133     chr1    260895  0       *       =       260895  0       ACCATTTTCCCTCCACCTTCAGTGAGTTTTTACCTACATTTGTGTCTTAGTCCATCTTGTGCTTCTGTAAACAGAATACCTGAGGCTGGGTAATTTATAAGTAAAAAAGATTCATTTGGCTCACAATACTGGTAGCTGGAATGGCCGAC      ????????????????????????????????????????????????????????????????????????5?????????????????????????????????+?????????????????????555+?+5??5????5+55%5%      XA:Z:chr6,+172119311,68M1I80M,3;   MC:Z:150M       MD:Z:0c4a142    RF:Z:source     RG:Z:HHM22CCXX-7-1EBCC218       NM:i:3  LO:Z:UM_L       UQ:i:18 AS:i:135        XS:i:135

Many thanks