lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
482 stars 133 forks source link

sam2tsv lists incorrect reference sequence & positions #190

Closed dchakro closed 3 years ago

dchakro commented 3 years ago

Hi,

Duplicate of https://www.biostars.org/p/9493154/

Can anyone help me resolve an issue I'm having with sam2tsv. It is a nifty piece of software but I have been encountering issues with it regarding the numbering of nucleotides it shows for the reference sequence.

Here's what sam2tsv tells me: Output of sam2tsv

The nucleotide string marked in red CTGGCCGAGCTAG is the read reporting the mutation T>A (line #469). But the reference sequence listed by sam2tsv (green box) doesn't match the read sequence at all. In fact it is correct sequence from the reference fasta, but it is right-shifted by 16 bases. With other reference files, this number is different, e.g. 20.

In fact, if I search the sequence ATGGAGACCCGCT in my reference sequence it spans 356-368. In contrast sam2tsv lists these residues in the range 339-352 (as seen in the green box). Off by exactly 16 bp.

searching within the reference

To summarize: 1) The position listed in the green box (reference), corresponds to the sequence in the red box (read). ✅ 2) The reference sequence listed in the green box, is off by 16 bases. ❌

Information: PacBio HiFi CCS reads aligned using pbmm2 to custom reference (cDNA).

Files:

Reference.fa.

relevant sam2tsv output

samfile with reads

Command used:

java -jar '~/Git/Others_Cloned/jvarkit/dist/sam2tsv.jar' -R ../../../reference/pBG-ERBB2/ERBB2.fa 1.ERBB2_Library.bam
lindenb commented 3 years ago

hi, can you please also send a snapshot of the sam in that region please.

dchakro commented 3 years ago

Here is the sam corresponding to this read. As these are pacbio reads spanning the entire cDNA, on using samtools view <region> I still get the entire file.

I can give more reads as "control", if required.

lindenb commented 3 years ago

what version are you using ? (java -jar ${JVARKIT_DIST}/sam2tsv.jar --version) I don't get the same result as you.

m64145_210904_052513/66383/ccs  0   60  ERBB2   574 T   ~   458 T   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   575 C   ~   459 C   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   576 T   ~   460 T   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   577 T   ~   461 T   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   578 G   ~   462 G   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   579 A   ~   463 A   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   580 T   ~   464 T   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   581 C   ~   465 C   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   582 C   ~   466 C   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   583 A   ~   467 A   EQ
m64145_210904_052513/66383/ccs  0   60  ERBB2   584 G   ~   468 G   EQ
dchakro commented 3 years ago

That is fascinating ! Those are the exact numbers it should output! For me the version says It says b4af26853 which seems to be the hash for the latest commit, should I be using the 2021.07.01 release instead ?

What version are you using ? It seems that version does it right 👍🏼

JRE: openjdk version 1.8.0_292 on Mac OS Catalina 10.15.7

dchakro commented 3 years ago

JRE: openjdk version 1.8.0_292 on Mac OS Catalina 10.15.7 Looks like I had an ancient version of java. As stated on the manual page for sam2tsv, I installed openjdk version "11.0.12" 2021-07-20

However, the results are still incorrect. results with jdk 11

I'm working with b4af26853 here. Can you let me know the version you are using @lindenb ?

lindenb commented 3 years ago

I'm using 1b3e21985b3afdd9d0a28545c29b3e5aacf53d7b (dev branch) but I think there is no major difference about the two versions for this tool.

dchakro commented 3 years ago

I'm using 1b3e219 (dev branch) but I think there is no major difference about the two versions for this tool.

Thank you Pierre! Out of curiosity. what version of java are you running ?

lindenb commented 3 years ago

java 1.8 but it's unrelated to the problem.

lindenb commented 3 years ago

I pushed a new release, can you please try with that one : https://github.com/lindenb/jvarkit/releases/tag/2021.10.13

dchakro commented 3 years ago

Thank You!.. Trying it right away

dchakro commented 3 years ago

Pierre, it is working now (and works for my other reference sequences as well)! Steps I took to make it work.

Curiously I tried with the "old" reference-dictionary-index triad... And I get the same incorrect results. Re-created the dictionary with picard and now it works exactly as expected!

Concluded root cause: The dictionary built (with picard) using the older version of java was the culprit.

We can close this issue now. I will close the entry on biostars.

Thank you for this nifty piece of software !!