Closed koisland closed 7 months ago
Samples with multiple mappings get removed when the contig is renamed in this section:
https://github.com/logsdon-lab/hgsvc3/blob/9ee3fcad086cc64262d0e2977003b4a8013102b8/workflow/rules/rename_ctgs.smk#L116-L120
For HG00513 and contig haplotype1-0000030:
HG00513
haplotype1-0000030
(base) [koisland@sarlacc cens]$ grep "000030" HG00513_centromeric_regions.all.bed haplotype1-0000030 7767928 10629506 97204658 chr14:9567392-13280157 - 2861578 haplotype1-0000030 7767928 10629506 97204658 chr14:9567392-13280157 + 2861578 haplotype1-0000030 97293 11270481 97204658 chr22:7989243-17375108 - 11173188 haplotype1-0000030 97293 11270481 97204658 chr22:7989243-17375108 + 11173188 haplotype1-0000030 601202 19397775 97204658 chr2:91797392-95576642 - 18796573 haplotype1-0000030 601202 19397775 97204658 chr2:91797392-95576642 + 18796573
When creating the legend for the oriented beds, initially, it correctly lists all three.
awk -v OFS="\t" '{print $0, FILENAME}' HG00513_centromeric_regions.fwd.bed | \ sed 's//\t/g' | \ sed 's/:/\t/g' | \ awk -v OFS="\t" '{print $1, $9""$5"_"$1}' | \ sort -k2,2 | grep 00030
haplotype1-0000030 HG00513_chr14_haplotype1-0000030 haplotype1-0000030 HG00513_chr22_haplotype1-0000030 haplotype1-0000030 HG00513_chr2_haplotype1-0000030
However, when loading it in awk as an associative array, only one key is allowed so it takes the last element. This results in the following names.
HG00513_chr2_haplotype1-0000030:7767929-10629506 2861578 833894731 2861578 2861579 HG00513_chr2_haplotype1-0000030:97294-11270481 11173188 836756358 11173188 11173189 HG00513_chr2_haplotype1-0000030:601203-19397775 18796573 847929596 18796573 18796574
Samples with multiple mappings get removed when the contig is renamed in this section:
https://github.com/logsdon-lab/hgsvc3/blob/9ee3fcad086cc64262d0e2977003b4a8013102b8/workflow/rules/rename_ctgs.smk#L116-L120
For
HG00513
and contighaplotype1-0000030
:When creating the legend for the oriented beds, initially, it correctly lists all three.
However, when loading it in awk as an associative array, only one key is allowed so it takes the last element. This results in the following names.