parklab / xTea

Comprehensive TE insertion identification with WGS/WES data from multiple sequencing technics
Other
102 stars 23 forks source link

User request CHM13 libs #50

Closed simoncchu closed 1 year ago

simoncchu commented 2 years ago

Question from https://github.com/parklab/xTea/issues/20

Hi, I tried using >5900bp as the cutoff for the full length L1. I run hg38 first to see whether I can reproduce the result in the provided hg38 rep_lib_annotation data. It turned out that the result I got was much larger than the annotation file provided. For example, the hg38_FL_L1_flanks.fa file I got is 53MB (using -e 100), while the size of hg38_FL_L1_flanks_3k.fa in the provided rep_lib_annotation file is 2MB. I attached my code here, any idea where is incorrect? The hg38 reference genome and repeatmasker output file are all from UCSC.


#########
grep "LINE1" hg38.fa.out > hg38.fa_L1.out
cat hg38.fa_L1.out | while read line
do
eval{line}|awk '{printf("var_9=%s;var_12=%s;var_13=%s;var_14=%s;",$9,$12,$13,$14)}')
if [ $var_9 == "C" ];then
i_length=$(($var_13 - $var_14))
else
i_length=$(($var_13 - $var_12))
fi
if [ $i_length -gt 5900 ];then
echo "$line"
fi
done >hg38.fa_L1_full_length.out ### this is to select out the LINE1 >5900bp

python x_TEA_main.py -P -K -p ./ -r hg38.fa -a hg38.fa_L1_full_length.out -o hg38.fa_L1_full_length_with_flank_e100.fa -e 100 #########


> And is it reasonable to set cutoff for full-length Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

> It would be super helpful if you could kindly add chm13 into the rep_lib_annotation data. Thank you!
anderswe commented 1 year ago

Also interested in CHM13 in rep_lib_annotation! Thank you!

zhuxf-lab commented 1 year ago

Any chance the CHM13 lib will be out soon? We got stuck in the lib preparation. Thank you!

mikecuoco commented 1 year ago

Hi @simoncchu have you had any luck with generating the libraries for the T2T-CHM13v2.0 reference? I tried to follow your instructions, but it looks like you have additional custom files for each human TE type, so I'm worried the custom implementation will be suboptimal.

UCSC recently published the build and RepeatMasker output on the genome browser FTP server here. Let me know if I can do anything to help!

simoncchu commented 1 year ago

Added the CHM13v2 support. Please have a try.