vaquerizaslab / Chang_et_al_Zebrafish_TEs

10 stars 4 forks source link

A bug in <3.make_TE_classification.sh> #2

Closed Taoo777 closed 1 year ago

Taoo777 commented 1 year ago

Hello! Thanks for this good workflow. I thought one input file (TE_GTF) danRer11.TEtrans_uID.gtf in 3.make_TE_classification.sh may be incorrect (shown below).

TE_GTF="./data/**danRer11.TEtrans_uID.gtf**"
time python ./scripts/clean_OCTFTA_output.py \
  --te_gtf ./data/**danRer11.TEtrans_uID.gtf** \
  --input ./data/danRer11.nonalt.fa.out.elem_sorted.csv \
  --te_length ./data/tes.lengths \
  --te_length_cutoff 2 \
  --output ./data/danRer11.TEtrans_uID.dfrag.gtf

According to the workflow, TE_GTF: danRer11.TEtrans_uID.gtf was generated in 2.make_gene_and_TE_counts.sh, as described in the following code in <2.make_gene_and_TE_counts.sh>:

rmsk2bed < ./data/danRer11.nonalt.fa.out > ./data/danRer11.nonalt.fa.out.bed
# Convert to GTF format
time python ./scripts/RM_bed2GTF.py -i ./data/danRer11.nonalt.fa.out.bed -o ./data/danRer11.nonalt.fa.out.gtf
# Adapt GTF to TEtranscript requested format
# Use ./data/danRer11_classes_wredundant.txt file to solve some TE superfamily/family name redundancy.
# Subset TE classes: DNA, LINE, LTR, RC, SINE
sed 's/?//g' ./data/danRer11_classes_wredundant.txt \
  | awk '{if ($2=="DNA" || $2=="LINE" || $2=="LTR" || $2=="RC" || $2=="SINE") print }' \
  > ./data/danRer11_classes_wredundant.subClass.txt
python ./scripts/adapt_GTF_TEtranscript.py \
  -i ./data/danRer11.nonalt.fa.out.gtf \
  -c ./data/danRer11_classes_wredundant.subClass.txt \
  -o ./**data/danRer11.TEtrans_uID.gtf**

After running the above code, I got danRer11.TEtrans_uID.gtf, which looks like:

chr1    RepeatMasker    exon    1       397     1686    -       .       gene_id "Harbinger-N29_DR"; transcript_id "Harbinger-N29_DR"; family_id "PIF-Harbinger"; class_id "DNA";
chr1    RepeatMasker    exon    403     462     404     +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe"; family_id "L2"; class_id "LINE";
chr1    RepeatMasker    exon    460     532     309     +       .       gene_id "hAT-N25_DR"; transcript_id "hAT-N25_DR"; family_id "hAT-Ac"; class_id "DNA";
chr1    RepeatMasker    exon    469     611     494     +       .       gene_id "Harbinger-8N2_DR"; transcript_id "Harbinger-8N2_DR"; family_id "PIF-Harbinger"; class_id "DNA";

However, it's not the format that clean_OCTFTA_output.py recognizes, and the correct format looks like:

chr1    RepeatMasker    exon    403     462     404     +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
chr1    RepeatMasker    exon    625     1027    2198    +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe_dup1";  family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"
chr1    RepeatMasker    exon    1026    1390    1791    +       .       gene_id "L2-5_DRe"; transcript_id "L2-5_DRe_dup2"; family_id "L2"; class_id "LINE"; geneId "ENSDARG00000099104"; transcriptId "ENSDART00000158290"; annotation_fix "Distal"; TE_GE_strand "OS"; annotation_2 "Intergenic.OS"; annotation_3 "Intergenic.OS"

And I got the following errors when running ./scripts/adapt_GTF_TEtranscript.py in 3.make_TE_classification.sh

  File "./scripts/clean_OCTFTA_output.py", line 318, in <module>
    print_TEgtf_line = make_TEgtf_print_line(TEgtf_lines[i])
  File "./scripts/clean_OCTFTA_output.py", line 180, in make_TEgtf_print_line
    attribute_string = 'gene_id "%s"; transcript_id "%s"; defrag_transcript_id "%s"; family_id "%s"; class_id "%s"; geneId "%s"; transcriptId "%s"; annotation_fix "%s"; TE_GE_strand "%s"; annotation_2 "%s"; annotation_3 "%s"' % (TEgtf_line[8]['gene_id'], TEgtf_line[8]['transcript_id'], TEgtf_line[8]['defrag_transcript_id'], TEgtf_line[8]['family_id'], TEgtf_line[8]['class_id'], TEgtf_line[8]['geneId'], TEgtf_line[8]['transcriptId'], TEgtf_line[8]['annotation_fix'], TEgtf_line[8]['TE_GE_strand'], TEgtf_line[8]['annotation_2'], TEgtf_line[8]['annotation_3'])
KeyError: 'geneId'

So I suspected danRer11.TEtrans_uID.gtf might not be the right input file for the following code of <3.make_TE_classification.sh>.

TE_GTF="./data/**danRer11.TEtrans_uID.gtf**"
time python ./scripts/clean_OCTFTA_output.py \
  --te_gtf ./data/**danRer11.TEtrans_uID.gtf** \
  --input ./data/danRer11.nonalt.fa.out.elem_sorted.csv \
  --te_length ./data/tes.lengths \
  --te_length_cutoff 2 \
  --output ./data/danRer11.TEtrans_uID.dfrag.gtf

Can you help me check if there is anything wrong with the input file of 3.make_TE_classification.sh? Thank you!

Taoo777 commented 1 year ago

@quirze

quirze commented 1 year ago

Hello @Taoo777,

Indeed, it seems like the input format that clean_OCTFTA_output.py expects is not necessarily the one provided in the previous step on the pipeline. Some of the extra attributes expected in the input GTF are added by the script run_ChIPseeker_TE_annot.R. If you would like to avoid the error messages, I would suggest modifying the function make_TEgtf_print_line in your clean_OCTFTA_output.py script to ignore the attributes that cause the error.

I hope this was helpful. Quirze

Taoo777 commented 1 year ago

@quirze OK! I know how to get extra attributes of input GTF now. I tried to modify the function make_TEgtf_print_line and successfully solved the error problems, thank you for your reply!