ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

CONCOCT: SAMtools duplicated header #67

Closed ndreey closed 1 year ago

ndreey commented 1 year ago

Both on the original run and on the re-run there were samples which failed because samtools detected duplicates in the header of the .sam file.

On the re-run, sample 06, 07 and 090 failed.

Fri Apr 28 00:13:40 CEST 2023    SAMtools Engaged!
[W::sam_hdr_create] Duplicated sequence "scaffold_10_from_617420_to_617569_total_150" in file "06_map.sam"
[W::sam_hdr_create] Duplicated sequence "scaffold_17_from_94_to_243_total_150" in file "06_map.sam"
[W::sam_hdr_create] Duplicated sequence "scaffold_18_from_271759_to_271908_total_150" in file "06_map.sam"
[E::sam_hrecs_update_hashes] Duplicate entry "scaffold_10_from_617420_to_617569_total_150" in sam header
samtools view: failed to add PG line to the header

Fri Apr 28 00:53:10 CEST 2023    SAMtools Engaged!
[W::sam_hdr_create] Duplicated sequence "scaffold_9_from_601230_to_601379_total_150" in file "07_map.sam"
[E::sam_hrecs_update_hashes] Duplicate entry "scaffold_9_from_601230_to_601379_total_150" in sam header
samtools view: failed to add PG line to the header

Fri Apr 28 01:34:30 CEST 2023    SAMtools Engaged!
[W::sam_hdr_create] Duplicated sequence "scaffold_7_from_597494_to_597643_total_150" in file "090_map.sam"
[E::sam_hrecs_update_hashes] Duplicate entry "scaffold_7_from_597494_to_597643_total_150" in sam header
samtools view: failed to add PG line to the header

These are the lines that are causing the issue for 090.

~  home / concoct_arena / goldst ~ cat 090_map.sam | grep "scaffold_7_from_597494_to_597643_total_150"
@SQ     SN:scaffold_7_from_597494_to_597643_total_150   LN:150
@SQ     SN:scaffold_7_from_597494_to_597643_total_150   LN:150

scaffold_7-1494 81      scaffold_7_from_597646_to_597795_total_150      12      42      135M    scaffold_7_from_597494_to_597643_total_150   5       0
AGGAGCCCAGCGAGGAGTACATTATCGTACCTCAGTACCCGCTCATGCGCACAGAGGGCTCTCAGGAACCTGTTCAATCCAAGGTCGAGGACGGGATCCGGGTATATGCAGGTGGCGAGGACGTCCGTGTCAAGG
GCGGGGG$GGGGGGGGG$CGGGGGGGGGGG=GGC=GCCJGGG=GGGC(CCGG==GGGGGJCJGGGGCGGGGGGGCGGJCCCJGCGJGJGJGGCJJCGJGGGCGJJGGJJJJCGJJJJJJJJJJ8JCGGGCGGG=G
AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:17C117 YS:i:-2  YT:Z:DP

scaffold_7-1494 161     scaffold_7_from_597494_to_597643_total_150      5       42      38M     scaffold_7_from_597646_to_597795_total_150   12      0
CATCTCGGCCTAGGGCAACGTCCCCGCCTCCTCCCCCG
CGGG=GGGGGGGGGGGGJJG$CJJJGJGCGGGG$CGGJ
AS:i:-2 XN:i:0       XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:20C17      YS:i:-2 YT:Z:DP

scaffold_7-5188 81      scaffold_7_from_597668_to_597817_total_150      28      42      119M    scaffold_7_from_597494_to_597643_total_150   5       0
AGGGGTGGAATATCTCCACGGCAGGGAGCCCCCCATATGCCGTGGCGACCTGAAATCCGTAAGTCGCGATGCAAGGCCTGGCGGCACACTGCGCGTTGCAAACATAGGCCCATGCTTTA
CCCCGGGGCGGGGGCGG=GCC1CCCGGGGGGGGGGGGGGGG$GCG1CGGGC=GGCGJCJGJJCGGJCJCJCG=GJGGJGJJJJCJ8JGGJJGJJJJJJJJJJJJCCGJGJGGGGGGGGG
AS:i:-2 XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:41A77      YS:i:-4 YT:Z:DP

scaffold_7-5188 161     scaffold_7_from_597494_to_597643_total_150      5       42      107M    scaffold_7_from_597668_to_597817_total_150   28      0
AAACTTGAAGGATTCGACGAGGACGTATCGAATAGAATGATCTGGCTTGTCTTTCCTTGGCAAGAGAATGGTACTTTGAAGGACTTTATCGCTTCGGCTGACTCGGA
GGGGG$GGGJ$CJJJJ$JJCJJJ=CGJJG8JJ$JGJJJGCGGGGJJCJG1JCJGGJGCG8GGGGGJ(CGGCCGGGGGGG1GGCGGGCGG=GCGCGCCJC8CJC$GGG
AS:i:-4  XN:i:0  XM:i:2  XO:i:0  XG:i:0  NM:i:2  MD:Z:16T86G3    YS:i:-2 YT:Z:DP
ndreey commented 1 year ago

090

To resolve this issue I will remove all dupes instead of selecting one. grep -v "scaffold_7_from_597494_to_597643_total_150" 090_map.sam > 090_fixed_map.sam I can then rename it to 090_map.sam. With the removed dupes i tested samtools view -@ 6 -b -S 090_map.sam > 090_map.bam and it ran without issues.

07

Here it seems that it is just @SQ that is duplicated...

~  home / concoct_arena / goldst ~ cat 07_map.sam | grep "scaffold_9_from_601230_to_601379_total_150"
@SQ     SN:scaffold_9_from_601230_to_601379_total_150   LN:150
@SQ     SN:scaffold_9_from_601230_to_601379_total_150   LN:150

scaffold_9-3296 97      scaffold_9_from_601050_to_601199_total_150      5       42      48M     scaffold_9_from_601230_to_601379_total_150   37      0
TGCAACCGTACTAATAAATACGACGAGCATACAAGGACAGGATGCTAC
CGC1GGCGGJGJCJGJJJGGCJJJ=JJJGJ=GJGGCJGGCGJJGGCC=
AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:48 YS:i:-6 YT:Z:DP

scaffold_9-3296 145     scaffold_9_from_601230_to_601379_total_150      37      42      110M    scaffold_9_from_601050_to_601199_total_150   5       0
AGCTAATGCAGGTTGCTTTTCAGCCCGTGCACATGTACCTGCCCGTCCGTATGCAGATTATGATGAGTGTCATAATCCCGTGTAGAGAGTAGGTCACAATAAGATATTGA
GCCGGCGJJJ$JJGGGGGCCGC=GCGC=C1GGGGGGGCGGGGC$GGGGJ==GJG$CGGCCC88J8GGJJ1GJJJJGJGGJCJJJGJJGCGCG$JJ=JGJJGGCGGG$GCG
AS:i:-6 XN:i:0  XM:i:3  XO:i:0  XG:i:0  NM:i:3  MD:Z:10C32T48C17        YS:i:0  YT:Z:DP

I, therefore, used awk to remove the first @SQ line. awk '!seen[$0]++ || !/SN:scaffold_9_from_601230_to_601379_total_150/' 07_map.sam > 07_fixed_map.sam

06

Multiple duplicates...

scaffold_10_from_617420_to_617569_total_150

Only duplicated @SQ, removed using awk

scaffold_17_from_94_to_243_total_150

Same as before.. using awk

scaffold_18_from_271759_to_271908_total_150

Like the 090 sample, I will therefore remove all dupes.

06

ndreey commented 1 year ago

When running CONCOCT i ran into errors for samples 06 and 090, 07 was successful.

Errors in BED line 'scaffold_7_from_597494_to_597643_total_150  0       150     scaffold_7_from_597494_to_597643_total_150.concoct_part_0'

They all happen in the coverage table step of concoct_run.sh

# Generates the coverage table
concoct_coverage_table.py ${hc_prefix}_contigs_10K.bed \
    ${hc_prefix}_map_sorted.bam > ${hc_prefix}_coverage_table.tsv
echo "$(date)    Coverage file is generated"

The issue seems to be:

~  home / concoct_arena / goldst ~ cat 090_contigs_10K.bed | grep "scaffold_7_from_597494_to_597643_total_150"
scaffold_7_from_597494_to_597643_total_150      0       150     scaffold_7_from_597494_to_597643_total_150.concoct_part_0
scaffold_7_from_597494_to_597643_total_150      0       150     scaffold_7_from_597494_to_597643_total_150.concoct_part_0

I will retry again but by removing the two lines from the bed file first.

~ andbo  ﲵ bash                                                                                  0.569s  Friday at 1:03 PM
~  home / concoct_arena / goldst ~ grep -v "scaffold_7_from_597494_to_597643_total_150" 090_contigs_10K.bed > 090_10k.bed

~ andbo  ﲵ bash                                                                                   0.94s  Friday at 1:09 PM
~  home / concoct_arena / goldst ~ concoct_coverage_table.py 090_10k.bed 090_map_sorted.bam > 090_cov_tab.tsv

Seems to have done the trick