Closed ndreey closed 1 year ago
To resolve this issue I will remove all dupes instead of selecting one.
grep -v "scaffold_7_from_597494_to_597643_total_150" 090_map.sam > 090_fixed_map.sam
I can then rename it to 090_map.sam.
With the removed dupes i tested samtools view -@ 6 -b -S 090_map.sam > 090_map.bam
and it ran without issues.
Here it seems that it is just @SQ
that is duplicated...
~ home / concoct_arena / goldst ~ cat 07_map.sam | grep "scaffold_9_from_601230_to_601379_total_150"
@SQ SN:scaffold_9_from_601230_to_601379_total_150 LN:150
@SQ SN:scaffold_9_from_601230_to_601379_total_150 LN:150
scaffold_9-3296 97 scaffold_9_from_601050_to_601199_total_150 5 42 48M scaffold_9_from_601230_to_601379_total_150 37 0
TGCAACCGTACTAATAAATACGACGAGCATACAAGGACAGGATGCTAC
CGC1GGCGGJGJCJGJJJGGCJJJ=JJJGJ=GJGGCJGGCGJJGGCC=
AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:48 YS:i:-6 YT:Z:DP
scaffold_9-3296 145 scaffold_9_from_601230_to_601379_total_150 37 42 110M scaffold_9_from_601050_to_601199_total_150 5 0
AGCTAATGCAGGTTGCTTTTCAGCCCGTGCACATGTACCTGCCCGTCCGTATGCAGATTATGATGAGTGTCATAATCCCGTGTAGAGAGTAGGTCACAATAAGATATTGA
GCCGGCGJJJ$JJGGGGGCCGC=GCGC=C1GGGGGGGCGGGGC$GGGGJ==GJG$CGGCCC88J8GGJJ1GJJJJGJGGJCJJJGJJGCGCG$JJ=JGJJGGCGGG$GCG
AS:i:-6 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:10C32T48C17 YS:i:0 YT:Z:DP
I, therefore, used awk
to remove the first @SQ
line.
awk '!seen[$0]++ || !/SN:scaffold_9_from_601230_to_601379_total_150/' 07_map.sam > 07_fixed_map.sam
Multiple duplicates...
Only duplicated @SQ
, removed using awk
Same as before.. using awk
Like the 090 sample, I will therefore remove all dupes.
grep -v "scaffold_18_from_271759_to_271908_total_150" 06_map.sam > 06_no_dupe.sam
awk '!seen[$0]++ || !/SN:scaffold_10_from_617420_to_617569_total_150/' 06_no_dupe.sam > 06_no_dupe2.sam
awk '!seen[$0]++ || !/SN:scaffold_17_from_94_to_243_total_150/' 06_no_dupe2.sam > 06_no_dupe3.sam
When running CONCOCT
i ran into errors for samples 06 and 090, 07 was successful.
Errors in BED line 'scaffold_7_from_597494_to_597643_total_150 0 150 scaffold_7_from_597494_to_597643_total_150.concoct_part_0'
They all happen in the coverage table step of concoct_run.sh
# Generates the coverage table
concoct_coverage_table.py ${hc_prefix}_contigs_10K.bed \
${hc_prefix}_map_sorted.bam > ${hc_prefix}_coverage_table.tsv
echo "$(date) Coverage file is generated"
The issue seems to be:
~ home / concoct_arena / goldst ~ cat 090_contigs_10K.bed | grep "scaffold_7_from_597494_to_597643_total_150"
scaffold_7_from_597494_to_597643_total_150 0 150 scaffold_7_from_597494_to_597643_total_150.concoct_part_0
scaffold_7_from_597494_to_597643_total_150 0 150 scaffold_7_from_597494_to_597643_total_150.concoct_part_0
I will retry again but by removing the two lines from the bed file first.
~ andbo ﲵ bash 0.569s Friday at 1:03 PM
~ home / concoct_arena / goldst ~ grep -v "scaffold_7_from_597494_to_597643_total_150" 090_contigs_10K.bed > 090_10k.bed
~ andbo ﲵ bash 0.94s Friday at 1:09 PM
~ home / concoct_arena / goldst ~ concoct_coverage_table.py 090_10k.bed 090_map_sorted.bam > 090_cov_tab.tsv
Seems to have done the trick
Both on the original run and on the re-run there were samples which failed because samtools detected duplicates in the header of the .sam file.
On the re-run, sample 06, 07 and 090 failed.
These are the lines that are causing the issue for 090.