Closed Huanle closed 4 years ago
Hi, it's hard to anwser without seeing the bam. Can you please post a subsection of the bam overlapping the region.
what is ref_2244 in awk '/ref_2244/ ...
the name of the chromosome ? there is no such chromosome.
samtools view -H test.bam | grep ref_2244
or no information with this 'ref_2244'
samtools view test.bam | grep ref_2244
my bad. real reference Id is 'cc6m_2244_T7_ecorv'.
you cannot use $7 to filter for REF position with an insertion because there is a '.' for an insertion. See the 2nd row below:
5c68a3f2-711a-41ef-97a6-0f0f355f50ad 0 cc6m_2244_T7_ecorv 6 T , 38 N M
5c68a3f2-711a-41ef-97a6-0f0f355f50ad 0 cc6m_2244_T7_ecorv 7 G . . . I
5c68a3f2-711a-41ef-97a6-0f0f355f50ad 0 cc6m_2244_T7_ecorv 8 G ( 39 N M
I'm closing this one.
thanks a lot. this is indeed the reason.
Hi @lindenb , Sorry to bug you again. But after diving into the tsv file and compare it to the pileup file. I found the inconsistence below.
# Insertions - from pileup
awk '$2==40' test.bam.pileup| perl -ne 'chomp;@a = $_ =~ /\+\d+[AGCT]/g; print scalar(@a),"\n";'
23
# deletions from pileup
awk '$2==40' test.bam.pileup| perl -ne 'chomp;@a = $_ =~ /\-\d+[AGCT]/g; print scalar(@a),"\n";'
128
# Insertions from sam2tsv result
grep -P 'I$' test.bam.tsv -C 1 |awk '$7==40'|wc -l
30
#deletions from sam2tsv result
awk '$7==40 && $9 == "D"' test.bam.tsv | wc -l
5
Hope you can help me figure out the potential cause to what i have seen. Thanks a lot! test.bam.pileup.txt
again, it's hard to answer without the bam. Can you please attach the bam , with this line at 40 only please.
Hi @lindenb,
bam files are attached. They are the same as previously uploaded.
I don't get the same result. with your command line.
samtools mpileup ~/test.bam | awk '$2==40'| perl -ne 'chomp;@a = $_ =~ /\+\d+[AGCT]/g; print scalar(@a),"\n";'
[mpileup] 1 samples in 1 input files
0
the output of samtools mpileup test.bam
doesn't look at all like the file test.bam.pileup.txt provided.
so l looked at your BAM.
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 35 60 1S5M1D15M(...)...
this read starts at 35 after one clipped base. There are 5 bases matching followed by one deletion. So
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 0 T % 34 N S
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 1 G % 35 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 2 G 0 36 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 3 C ( 37 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 4 T ) 38 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 5 G * 39 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv . . . 40 N D
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 6 G * 41 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 7 G + 42 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 8 A 5 43 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 9 A 4 44 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 10 G 3 45 N M
b654b8ee-46a8-4b86-8b24-0ea4ca6a2b02 0 cc6m_2244_T7_ecorv 11 T 4 46 N M
so, unless I'm wrong. As far as I can see, there is a deletion at 40 for this read.
The doc says http://www.htslib.org/doc/samtools-mpileup.1.html
If there is a deletion after this read base, text matching “-[0-9]+[ACGTNacgtn]+”: a “-” character followed by the deleted reference bases represented similarly
for the read above I can see the deletion at REF position 39 (so the deletion is AFTER 39...)
$ samtools mpileup -Q 0 ~/test2.bam | grep -w 40 -C 2
[mpileup] 1 samples in 1 input files
cc6m_2244_T7_ecorv 38 N 1 T )
cc6m_2244_T7_ecorv 39 N 1 G-1N *
cc6m_2244_T7_ecorv 40 N 1 * *
cc6m_2244_T7_ecorv 41 N 1 G *
cc6m_2244_T7_ecorv 42 N 1 G +
Hi @lindenb , I think i have mislead you. I am sorry about this. The number of deletions at 40, counted from the pileup format is 128 and the number of insertions is 23. While these numbers counted from sam2tsv result are different. I am wondering what has caused this discrepancy? Thanks a lot for your patience.
Hi @lindenb , Sorry to bug you again. But do you have any idea of the potential cause of this discrepancy?
no, sorry I don't have the time to work on this.
Hi There, I converted my bad file into a tsv file in order to known alignment at per-base level. From pileup bases (see below) I know that at the given position (ref:40) there are inserstions (100+). However, I can not find them from tsv file.
Can you explain why?
pileup bases: