logsdon-lab / CenMAP

Centromere mapping and annotation pipeline
MIT License
9 stars 0 forks source link

Renaming contigs removes multiple mappings #18

Closed koisland closed 7 months ago

koisland commented 8 months ago

Samples with multiple mappings get removed when the contig is renamed in this section:

https://github.com/logsdon-lab/hgsvc3/blob/9ee3fcad086cc64262d0e2977003b4a8013102b8/workflow/rules/rename_ctgs.smk#L116-L120

For HG00513 and contig haplotype1-0000030:

(base) [koisland@sarlacc cens]$ grep "000030" HG00513_centromeric_regions.all.bed
haplotype1-0000030      7767928 10629506        97204658        chr14:9567392-13280157  -       2861578
haplotype1-0000030      7767928 10629506        97204658        chr14:9567392-13280157  +       2861578
haplotype1-0000030      97293   11270481        97204658        chr22:7989243-17375108  -       11173188
haplotype1-0000030      97293   11270481        97204658        chr22:7989243-17375108  +       11173188
haplotype1-0000030      601202  19397775        97204658        chr2:91797392-95576642  -       18796573
haplotype1-0000030      601202  19397775        97204658        chr2:91797392-95576642  +       18796573

When creating the legend for the oriented beds, initially, it correctly lists all three.

awk -v OFS="\t" '{print $0, FILENAME}' HG00513_centromeric_regions.fwd.bed | \
sed 's//\t/g' | \
sed 's/:/\t/g' | \
awk -v OFS="\t" '{print $1, $9""$5"_"$1}' | \
sort -k2,2 | grep 00030
haplotype1-0000030      HG00513_chr14_haplotype1-0000030
haplotype1-0000030      HG00513_chr22_haplotype1-0000030
haplotype1-0000030      HG00513_chr2_haplotype1-0000030

However, when loading it in awk as an associative array, only one key is allowed so it takes the last element. This results in the following names.

HG00513_chr2_haplotype1-0000030:7767929-10629506        2861578 833894731        2861578 2861579
HG00513_chr2_haplotype1-0000030:97294-11270481  11173188        836756358        11173188        11173189
HG00513_chr2_haplotype1-0000030:601203-19397775 18796573        847929596        18796573        18796574