haowenz / chromap

Fast alignment and preprocessing of chromatin profiles
https://haowenz.github.io/chromap/
MIT License
189 stars 20 forks source link

Error TLEN infomation at hic sam module #139

Open slbai01 opened 1 year ago

slbai01 commented 1 year ago

I am running chromap: 0.2.5-r473 on a highly repetitive genome, but the TLEN information in the sam file looks wrong.

Since only one best match is output by default, I suspect that some errors in handling multiple matches result in no output of the read matching information corresponding to TLEN.

Below is the information about several pairs of reads.

SRR12034698.8   115     chr1D   165570658       20      150M    =       164309919       -15705  GAATATTTTTTCTGACCATACATGCTCGGTCCGCCGAAGTTCTACGAGGGTAGCACTGTCCACTCGGACGATCGCCCAAATCATTACCTGAAGTCATCTTCAGGACTGCAAAAGGGTGAAAACGACACTCCTCTACGGATACACTTGGCA  -7A--FF-F<7-<F7AAAF7A-<A<F---<A-AF7FFF<-A--AAJJFJJJJJFFFJFJ<A-7AAAJJJJ7<JJA7AFJFFJF7FA<-JJJJJA<A7FA-AJJJ<FJFJJJJFJJJJJ<JAAFA-JJJF-7F<FAFJF-JFJJFJFFFAA  NM:i:1  MD:Z:39A110
SRR12034698.8   179     chr1D   164309919       20      48M     =       165570658       -15705  TTCTCGATGTGATCAACAGGTTGATNNAATGGNTGGANNNNCTNAGNG        7-A7--777---<FF7<7<J<<7JJ##JJJAJ#FAJF####7-#-)#A        NM:i:1  MD:Z:A47
SRR12034698.15  115     chr2B   457324039       15      150M    =       468582229       -14152  AAGTCTCAATCTGGATACATATTGAACCTGGGAGCAATTAGCTAGAGTAGCTCCGTGTAGAGCATTGTAGACATAGAATATTTGCAAAATGCATACGGCTCTGAATGTGGCAGACCCGTTGACTAAACTTCTCTCACGAGCAAAACATGA  JJA7FJJA<-A<-<JJJFJFFJJJA--<FJAF7F<JJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJAJJJJFJJJJJJJJJJJJJJJJJJFFJJFJJJJJJFFFAA  NM:i:1  MD:Z:26A123
SRR12034698.15  179     chr2B   468582229       15      104M    =       457324039       -14152  TTGGGACACTTCTATTTATATGCATNNAGGTANCGACNCNNTCNAGNAANCCAAATTAGATGTGCTTCAAAGTCAACTTGACAAGTTCAAGATGAAGGACGGTG        AJFF<-JF<JF7FFAFA<<--JF<-##<-F<<#)<7-#J##AA#JA#-F#JJJJJJJJJJJFAFFJJJJJJJJJJJJJJJJJJJJJF-JFJJJJJJAJFFJJJJ        NM:i:2  MD:Z:G2T100

command line:

chromap -i -k 27 -w 14 -r $contigsFasta -o contigs.index
chromap --preset hic -r $contigsFasta -x $contigsChromapIndex -1 $r1Reads -2 $r2Reads --SAM -o aligned.$library_name.sam -t $thread
haowenz commented 1 year ago

Is there a reason you change the default k and w values for index?

For the TLEN issue, @mourisl can you take a look when you get time? Thanks!

slbai01 commented 1 year ago

The genome is too large, and the default parameters can't complete the index. #9 has described this question.

mourisl commented 1 year ago

Hi @slbai01, sorry for the delayed reply. The TLEN issue could be due to the data type overflow. I have pushed an updates to the li_dev5 branch, could you please checkout that branch and give it a try? I think the split-alignment mode of Hi-C data, the sign of TLEN does not make much sense, and the length may be off by the read length depending on the strand. But at least, now the value of TLEN is reasonable.