parklab / xTea

Comprehensive TE insertion identification with WGS/WES data from multiple sequencing technics
Other
102 stars 23 forks source link

xTea on plant species #20

Open agolicz opened 3 years ago

agolicz commented 3 years ago

Hi, We wanted to try xTea on a plant species. Is that possible or does it require a human reference?

Agnieszka

simoncchu commented 3 years ago

Hi, the current repeat library is only prepared for human. You need to prepare a library for the plant species that you want to work on. Generally, you need to know what type (family) of repeats you are working on, the consensus sequence of them, the reference genome of the species, and also the repeat annotation (from RepeatMasker or other tools). With this, we can generate the library for the plant species and run on the alignments. I didn't try this before. I think it should work, but need extra effort for library prepare.

agolicz commented 3 years ago

Thanks! We will give it a try see how it goes.

mnshgl0110 commented 3 years ago

It would be really helpful if there is a guide on how to generate repeat library for non-humans.

@agolicz did you manage to run xTea on plants? Could you please share how did it go?

agolicz commented 3 years ago

Unfortunately I have not had the time yet. Hopefully at the beginning of next year. EDTA https://github.com/oushujun/EDTA, seems like a good candidate for plant repeat library creation.

simoncchu commented 2 years ago

I put a readme to prepare the repeat library here: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep. Would you like to have a try? Note, xTea only works for TE insertions of known type, and need the repatmasker annotation of the TEs you are interested, and also a consensus sequence.

agolicz commented 2 years ago

Would it also be possible to add fasta header format (>xxx) for the TE library and the repeatmasker command to be used with that library?

simoncchu commented 2 years ago

What do you mean by fasta header format? For which file? The repeatmasker output format is explained here: https://www.repeatmasker.org/webrepeatmaskerhelp.html. To run RepeatMasker on your assembled genome, you need to construct a consensus sequence library, and then feed in RepeatMasker (will call blast) to do the annotation. You can run tools like repeatscout to construct the consensus from the assembled genome. You can check the parameters by running RepeatMasker --help. If the species you are working on has some published reference genome, with big chance other people had annotated the genome, and you can directly use those.

agolicz commented 2 years ago

For the file: TE-type.consensus.fa and TE_copies_with_flank.fa Per repeat masker documentation: https://www.animalgenome.org/bioinfo/resources/manuals/RepeatMasker.html The recommended format for IDs in a custom library is:

repeatname#class/subclass or simply repeatname#class

I just wanted to confirm that xTea expects the same.

For running pipelines like that on non-model species it is very helpful to have toy datasets, so we can ensure everything is formatted as expected. Some formatting conventions are not the same for human/animal/plant genomics.

simoncchu commented 2 years ago

You only need to extract the TE type you wanted to work on (each type separately). For example, if you have a repeatmasker output for the whole genome named species_rmsk.out, and you are interested in say LINE1, then you could run grep "LINE1" species_rmsk.out > species_rmsk_L1.out.

When generate the TE_copies_with_flank.fa, you only need to feed in the full length copies, so you need to select based on length from the generated pecies_rmsk_L1.out.

I am thinking of having a script to automatically generate this, but it's not easy to have a fix mode. Different species/TEs are of different length and different ids (some are customized set).

agolicz commented 2 years ago

Ok, thanks that makes sense. I will try to give it a try in January.

adriaaanarcillo commented 2 years ago

Hello, @agolicz! I would like to ask if you have successfully used x-Tea on plants already? If so, how did it go? Thanks.

DR-genomics commented 2 years ago

Hello, I tried to create a plant repeat library using your instructions given in: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep. However, I received an error: xtea: error: no such option: -P

And I don't see -P option in the xtea help page as well. What does -P stand for here?

Thanks!

simoncchu commented 2 years ago

Could you try again by replacing xtea with python full-path/x_TEA_main.py ?

DR-genomics commented 2 years ago

I tried the following and got this error: python xtea/x_TEA_main.py -P -K -p ./ -r ../refgenome.fasta -a RMasker.out -o /home/xTea/TE_copies_with_flank.fa -e 100

Traceback (most recent call last): File "xtea/x_TEA_main.py", line 345, in x_annotation.load_rmsk_annotation() File "/gpfs20/scratch/dramacha/xTea/xtea/x_annotation.py", line 241, in load_rmsk_annotation start_pos = int(fields[5]) ValueError: invalid literal for int() with base 10: 'position'

zhuxf-lab commented 2 years ago

I am trying to prepare xTea repeat library using the chm13 genome. I got the TE-type_rmsk.out, but currently have trouble getting the full-length-TE-type_rmsk.out. Do you have any suggestions for how to get the full-length TE? By structure, or by length?

simoncchu commented 2 years ago

@zhuxf-lab it's based on length for the active Human retrotransposons. For example, L1, I set >5900bp as full length.

zhuxf-lab commented 2 years ago

@zhuxf-lab it's based on length for the active Human retrotransposons. For example, L1, I set >5900bp as full length.

Hi, I tried using >5900bp as the cutoff for the full length L1. I run hg38 first to see whether I can reproduce the result in the provided hg38 rep_lib_annotation data. It turned out that the result I got was much larger than the annotation file provided. For example, the _hg38_FL_L1flanks.fa file I got is 53MB (using -e 100), while the size of _hg38_FL_L1_flanks3k.fa in the provided rep_lib_annotation file is 2MB. I attached my code here, any idea where is incorrect? The hg38 reference genome and repeatmasker output file are all from UCSC.

######### grep "LINE1" hg38.fa.out > hg38.fa_L1.out cat hg38.fa_L1.out | while read line do
eval $(echo ${line}|awk '{printf("var_9=%s;var_12=%s;var_13=%s;var_14=%s;",$9,$12,$13,$14)}') if [ $var_9 == "C" ];then
i_length=$(($var_13 - $var_14)) else
i_length=$(($var_13 - $var_12)) fi if [ $i_length -gt 5900 ];then
echo "$line" fi done >hg38.fa_L1_full_length.out ### this is to select out the LINE1 >5900bp

python x_TEA_main.py -P -K -p ./ -r hg38.fa -a hg38.fa_L1_full_length.out -o hg38.fa_L1_full_length_with_flank_e100.fa -e 100 #########

And is it reasonable to set cutoff for full-length Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

It would be super helpful if you could kindly add chm13 into the rep_lib_annotation data. Thank you!

simoncchu commented 2 years ago

@zhuxf-lab I moved your question to a new issue https://github.com/parklab/xTea/issues/50, I'll work on it asap.

simoncchu commented 2 years ago

@zhuxf-lab while I am working on this issue, the size difference (53M vs 2M) is because I only select L1HS (reported active L1) rather than all the L1 subfamilies.

Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

For SVA, I set 700bp.

zhuxf-lab commented 2 years ago

@zhuxf-lab while I am working on this issue, the size difference (53M vs 2M) is because I only select L1HS (reported active L1) rather than all the L1 subfamilies.

Alu, SVA, HERV as 250bp, 1900bp, 8900bp?

For SVA, I set 700bp.

Ok, Thanks!

bismarck1008 commented 2 years ago

I'm very curious to know if the process for the custom repeat library is available now. https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep

simoncchu commented 2 years ago

@bismarck1008 it should work

bismarck1008 commented 2 years ago

xtea -P -K -p ./ -r path-of-reference-genome.fa -a path-to-rep-lib-folder/full-length-TE-type_rmsk.out -o path-output-folder/TE_copies_with_flank.fa -e 100 I tried the command line above. P, K, and e parameters are not identified

simoncchu commented 2 years ago

try python your-xtea-folder/x_TEA_main.py instead of xtea? @bismarck1008

adriludwig commented 1 year ago

Considering that other species have other elements than just L1, Alu, SVA and HERV, would xTea identify them? In this case which would be the option for y parameter? Thanks

simoncchu commented 1 year ago

@adriludwig , use "-y 32". Here is a readme: https://github.com/parklab/xTea/tree/master/xtea/rep_lib_prep (at the bottom). It's not convenient as you can only run one repeat type at a time. I'll try to write up a new version/wrapper for this.

adriludwig commented 1 year ago

Thanks very much, @simoncchu. We are currently using mobster, but we would also like to test other tools. So I'll keep an eye on xTea updates.

Anees-caas commented 4 months ago

Respected experts! I'm new to this topic and I'm curious to know if currently the repeated library were available for plants or is there any update on XTea for transposable element analysis in plants.