Assessing a wheat genome assembly

zhaoruolan commented 4 years ago

Dear Shujun,

We're using LTR_retriever to assess a wheat genome assembly on Magnus. Because the wheat genome is very large and we have a time limit, which is 24 hrs, so we can't finish the analysis within 24 hrs with -threads 24. Then we split the wheat genome into 63 parts using seqkit, but a lot of errors popped out. We only got the LAI for part 1, the rest 62 parts didn't get any LAI results as expected. So I have 2 questions:

Can we use -threads + a larger number to finish the analysis? for example, 100?
Because the first part has an LAI, so I compared the error profile for part 1 with the error profiles of the rest parts, I found: Use of uninitialized value $seq in string eq at /home/rzhao/Tools/LTR_retriever/bin/call_seq_by_list.pl line 128. substr outside of string at /home/rzhao/Tools/LTR_retriever/bin/call_seq_by_list.pl line 127.

These two errors are not found in the first part. How can we deal with these two errors? Can these be the reason of not getting any LAI results?

Thank you very much for you time. Regards, Ruolan

oushujun commented 4 years ago

Hi Ruolan,

Sorry for the delay response. You may run LTR_retriever on each wheat chromosome, which should be finished in one day. Please update to v2.7 if you have not, which is 100% faster than the previous version. The entire wheat genome finished in 8 days.

For LAI you still need to run each subgenome as a whole. Please refer to the -mono parameter for more details. You can combine intact LTRs of each subgenome together.

Best, Shujun

zhaoruolan commented 4 years ago

Hi Shujun,

Thank you for your reply!

Now we're building ltrharvest database for each part, 63 parts in total, then run ltr_retriever using specific database instead of one large ltrharvest database. I haven't got all the results. I plan to calculate LAI manually after getting results of 63 parts, that is, sum of column 4 row 1 from results will be divided by sum of column 5 row 1 from results. Sorry, I'm not following what you're suggesting in the second paragraph, we've already split the whole genome into 63 parts. Do we have to run LAI for each subgenome to get an accurate result?
I haven't updated yet. Sorry, I'm a beginner. Should I just modify the script as you do for the update?

Cheers, Ruolan

oushujun commented 4 years ago

Dear Ruolan,

Yes you can devide and conquer for LTR_retriever. Please also add the LTR_FINDER results, otherwise the sensitivity will be suboptimal. See my other Github repo for more information.

LAI should be run on only one subgenome if there is multiple, otherwise it will be overcorrected. If you don't know which piece of sequence belong to which subgenome, you may use whole-genome alignment against the CS2.0 assembly to figure out. You should not simply average LAIs from different runs to estimate the whole genome.

To update, simpy download the new version and use the same conda env or path file for it. Then you are good to go.

Best, Shujun

On Tue, Oct 22, 2019, 10:42 PM zhaoruolan notifications@github.com wrote:

Hi Shujun,

Thank you for your reply!

1.

Now we're building ltrharvest database for each part, 63 parts in total, then run ltr_retriever using specific database instead of one large ltrharvest database. I haven't got all the results. I plan to calculate LAI manually after getting results of 63 parts, that is, sum of column 4 row 1 from results will be divided by sum of column 5 row 1 from results. Sorry, I'm not following what you're suggesting in the second paragraph, we've already split the whole genome into 63 parts. Do we have to run LAI for each subgenome to get an accurate result? 2.

I haven't updated yet. Sorry, I'm a beginner. Should I just modify the script as you do for the update?

Cheers, Ruolan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/55?email_source=notifications&email_token=ABNX4NAGEM73KBYGYF2AJ5LQP7B3TA5CNFSM4I7IDQPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB76E3I#issuecomment-545251949, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NC5Y4WZUJEKBGA7TVTQP7B3TANCNFSM4I7IDQPA .

zhaoruolan commented 4 years ago

Thanks Shujun! :)

I updated ltr_retriever.

My supervisor wrote a script for dividing the assembly to the subgenomic level.

Cheers, Ruolan

zhaoruolan commented 4 years ago

Hi Shujun!

I would like to ask you another question. Because our group assembled a wheat genome, and we need to assess the quality of our genome and the quality of the reference genome.

After splitting the genomes into subgenomes, ltr_retriever runs well for our genomes, thank you! However, for the reference genome (IWGSC v1.0), I got following errors and the analysis stoped: Total candidates: 99967 Total uniq candidates: 99962

Thursday 14 November 21:13:52 AEDT 2019 Module 1: Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

Error: Error while loading sequenceIllegal division by zero at ./bin/cleanup.pl line 78, chunk 1. Thursday 14 November 21:13:52 AEDT 2019 0 clean candidates remained

cp: cannot stat '/mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.adj': No such file or directory Thursday 14 November 21:13:52 AEDT 2019 No LTR-RT was found in your data.

Thursday 14 November 21:13:52 AEDT 2019 All analyses were finished!

As I pasted above, 0 clean candidates remained and the analysis was finished without finding any LTR-RT...

Do you know what happened?

Thank you very much!

Cheers, Ruolan

oushujun commented 4 years ago

HI Ruolan,

Glad to know LTR_retriever is working on your genome. For the error, it's likely the input genome file does not match the input LTR candidate list. You may need to check if they were generated using different sequences or if the input LTR candidate list contains candidates that are not in the provided genome. Refer to #18 for more details.

Best, Shujun

zhaoruolan commented 4 years ago

Hi Shujun,

Thank you for replying me! My input file for the IWGSCv1.0 genome is in the format of:

14098 26283 12186 14098 17304 3207 23067 26283 3217 96.21 0 52970 60964 7995 52970 53122 153 60803 60964 162 90.12 0 77568 86766 9199 77568 78049 482 86285 86766 482 99.38 0 96733 106493 9761 96733 98310 1578 104909 106493 1585 95.90 0 122915 136445 13531 122915 123176 262 136184 136445 262 94.66 0 145617 161474 15858 145617 147413 1797 159685 161474 1790 97.55 0 172421 175085 2665 172421 173186 766 174333 175085 753 94.13 0

like you suggested.

I also confirmed that I used subgenome A and the ltrharvest output of subgenome A for the LTR_retriever inputs.

But the error still popped out...

Cheers, Ruolan

oushujun commented 4 years ago

Hi Ruolan,

Did you generate the LTRharvest output of subgenome A all in one command or split the subgenome into multiple parts (i.e., per chr) and combine afterward?

Thanks, Shujun

zhaoruolan commented 4 years ago

Hi Shujun,

Thank you for getting back to me so quickly! :)

I generated the LTRharvest output of subgenome A all in one command without splitting, as I did for our genomes.

Cheers, Ruolan

oushujun commented 4 years ago

Dear Ruolan,

Thanks for your info. Something is wrong. Can you list the file size of the intermediate files? Please use -v to retain intermediate files.

Best, Shujun

zhaoruolan commented 4 years ago

Hi Shujun,

Here are the files when I ran LTR_retriever on the IWGSC subgenome A with -v -rw-r----- 1 rzhao appbio 15 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa -rw-r----- 1 rzhao appbio 0 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa.cleanup -rw-r----- 1 rzhao appbio 0 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.ltrTE.stg1 -rw-r----- 1 rzhao appbio 0 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.nmtf.pass.list -rw-r----- 1 rzhao appbio 0 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.prelib -rw-r----- 1 rzhao appbio 9108648 Nov 18 19:45 Subgenome_A.iwgsc_v1.0.fasta.retriever.scn -rw-r----- 1 rzhao appbio 5337364 Nov 18 19:45 Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full -rw-r----- 1 rzhao appbio 11016349 Nov 18 19:45 Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.list

and the errors are the same with the previous one, no LTR_RT was found and the analysis stopped.

Cheers, Ruolan

oushujun commented 4 years ago

Hi Ruolan,

Please provide a couple lines for the .scn, .scn.full, .scn.list, and the genome sequence ID. Thanks!

Shujun

On Mon, Nov 18, 2019, 2:52 AM zhaoruolan notifications@github.com wrote:

Hi Shujun,

Here are the files when I ran LTR_retriever on the IWGSC subgenome A with -v -rw-r----- 1 rzhao appbio 15 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa -rw-r----- 1 rzhao appbio 0 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa.cleanup -rw-r----- 1 rzhao appbio 0 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.ltrTE.stg1 -rw-r----- 1 rzhao appbio 0 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.nmtf.pass.list -rw-r----- 1 rzhao appbio 0 Nov 18 19:47 Subgenome_A.iwgsc_v1.0.fasta.prelib -rw-r----- 1 rzhao appbio 9108648 Nov 18 19:45 Subgenome_A.iwgsc_v1.0.fasta.retriever.scn -rw-r----- 1 rzhao appbio 5337364 Nov 18 19:45 Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full -rw-r----- 1 rzhao appbio 11016349 Nov 18 19:45 Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.list

and the errors are the same with the previous one, no LTR_RT was found during the analysis and stopped.

Cheers, Ruolan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/55?email_source=notifications&email_token=ABNX4NCXH6KDYC4Z7JTLROTQUJJUHA5CNFSM4I7IDQPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEJVQPQ#issuecomment-554915902, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NGE6QXZD6B5HRV6X63QUJJUHANCNFSM4I7IDQPA .

zhaoruolan commented 4 years ago

Hi Shujun,

The genome sequence IDs for subgenome A are

chr1A chr2A chr3A chr4A chr5A chr6A chr7A

Subgenome_A.iwgsc_v1.0.fasta.retriever.scn:

14098 26283 12186 14098 17304 3207 23067 26283 3217 96.21 0 52970 60964 7995 52970 53122 153 60803 60964 162 90.12 0 77568 86766 9199 77568 78049 482 86285 86766 482 99.38 0 96733 106493 9761 96733 98310 1578 104909 106493 1585 95.90 0 122915 136445 13531 122915 123176 262 136184 136445 262 94.66 0 145617 161474 15858 145617 147413 1797 159685 161474 1790 97.55 0 172421 175085 2665 172421 173186 766 174333 175085 753 94.13 0 234677 243558 8882 234677 235312 636 242918 243558 641 94.85 0 272914 283807 10894 272914 275472 2559 281259 283807 2549 96.25 0

Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full: 39844:Sun..Nov 39844:Sun..Nov chr1A:100004427..100009468 chr1A:100004427..100009468 chr1A:100053652..100062171 chr1A:100053652..100062171 chr1A:100066387..100079480 chr1A:100066387..100079480 chr1A:10010544..10019122 chr1A:10010544..10019122 chr1A:100127430..100139720 chr1A:100127430..100139720 chr1A:100145807..100154594 chr1A:100145807..100154594 chr1A:10021901..10030687 chr1A:10021901..10030687 chr1A:100253209..100256815 chr1A:100253209..100256815 chr1A:100325142..100333885 chr1A:100325142..100333885 chr1A:100410041..100415386 chr1A:100410041..100415386

Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.list: chr1A:14098..26283[1] chr1A:14098..17304 chr1A:14098..26283[2] chr1A:23067..26283 chr1A:52970..60964[1] chr1A:52970..53122 chr1A:52970..60964[2] chr1A:60803..60964 chr1A:77568..86766[1] chr1A:77568..78049 chr1A:77568..86766[2] chr1A:86285..86766 chr1A:96733..106493[1] chr1A:96733..98310 chr1A:96733..106493[2] chr1A:104909..106493 chr1A:122915..136445[1] chr1A:122915..123176 chr1A:122915..136445[2] chr1A:136184..136445 chr1A:145617..161474[1] chr1A:145617..147413 chr1A:145617..161474[2] chr1A:159685..161474 chr1A:172421..175085[1] chr1A:172421..173186 chr1A:172421..175085[2] chr1A:174333..175085

Can you find any thing wrong with these files?

Thank you very much!

Cheers, Ruolan

oushujun commented 4 years ago

Hi Ruolan,

Thanks for the example lines. They look normal to me. Can you try: perl ...LTR_retriever/bin/call_seq_by_list.pl Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full -C Subgenome_A.iwgsc_v1.0.fasta > Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa

Thanks, Shujun

zhaoruolan commented 4 years ago

Hi Shujun,

Thanks! I tried perl /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full -C /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta > /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa

Then I got several lines like this: Semicolon seems to be missing at /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full line 99960. Bareword found where operator expected at /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full line 99961, near "99894103 chr7A" (Missing operator before chr7A?) Semicolon seems to be missing at /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full line 99961. syntax error at /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full line 1, near "39844:" Execution of /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full aborted due to compilation errors.

Then the analysis stopped and no data were written into Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa.

I compared the IWGSC scn.full with our genome scn.full, the formats are identical to me.

Cheers, Ruolan

oushujun commented 4 years ago

Hi Ruolan,

Never saw this before. Can you share the scn.list file with me? I have iwgsc1.0.

Sbujun

On Mon, Nov 18, 2019, 9:03 PM zhaoruolan notifications@github.com wrote:

Hi Shujun,

Thanks! I tried perl /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full -C /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta > /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa

Then I got several lines like this: Semicolon seems to be missing at /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full line 99960. Bareword found where operator expected at /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full line 99961, near "99894103 chr7A" (Missing operator before chr7A?) Semicolon seems to be missing at /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full line 99961. syntax error at /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full line 1, near "39844:" Execution of /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full aborted due to compilation errors.

Then the analysis stopped and no data were written into Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa.

I compared the IWGSC scn.full with our genome scn.full, the formats are identical to me.

Cheers, Ruolan

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/55?email_source=notifications&email_token=ABNX4NAIGA67N3SQYLB5AJDQUNJRDA5CNFSM4I7IDQPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEMV7OI#issuecomment-555311033, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NG3M5N4BDAA7HMJKU3QUNJRDANCNFSM4I7IDQPA .

zhaoruolan commented 4 years ago

Hi Shujun,

Thank you so much!

I've uploaded the scn.list. You can find it at:

https://drive.google.com/file/d/1ZAfs1_Tj2k_Mqa9YNAm_Mky964wn4KpJ/view?usp=sharing

Cheers, Ruolan

oushujun commented 4 years ago

Hi Ruolan,

Sorry for the delay response. Your earlier command was wrong: perl /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.full -C /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta > /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa

Please specify this script to run it: ...LTR_retriever/bin/call_seq_by_list.pl

I downloaded your data and run it with my IWGSC1.0 genome, it works without error: perl ~/las/git_bin/LTR_retriever/bin/call_seq_by_list.pl Subgenome_A.iwgsc_v1.0.fasta.retriever.scn_test.list -C 161010_Chinese_Spring_v1.0_pseudomolecules.fasta.A.list.fa > Subgenome_A.iwgsc_v1.0.fasta.retriever.scn_test.list.fa.

Was it possible that you specified a wrong genome file in your previous run?

Best, Shujun

zhaoruolan commented 4 years ago

Hi Shujun,

Thanks so much!

This time I ran it using the scn.list rather than scn.full. I do get a file: 211151133 Nov 27 22:23 Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa

With this file Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa, can I get the file Subgenome_A.iwgsc_v1.0.fasta.retriever.scn.adj? Since this is the key file for LTR_retriever to proceed.

And I got another problem for our genome, when I started to run ltrretriever for subgenome B of our genome, I didn't get fasta.out file, which is also very weird to me, because everything worked fine for subgenome A. So I didn't get the LAI. I looked at the manual and found that the fasta.out file is generated by repeatmasker. So I'm running repeatmasker now.

Thank you for your always kind help! :)

Cheers, Ruolan

oushujun commented 4 years ago

Dear Ruolan,

How are things going on your wheat subgenomes? From your last post, if you can get the Subgenome_A.iwgsc_v1.0.fasta.ltrTE.fa using call_seq_by_list.pl, that means both the genome sequence and your candidate list files are ok. You may need to rerun the same process from fresh and make sure the candidate list matches the genome version you are using.

For RepeatMasker, sometimes it will not generate output and I had this issue for other wheat genomes too. You may want to run it with extra memory or simply change to other machines. You may want to open an issue on their github if you could not resolve it.

Best, Shujun

Atvar2 commented 4 years ago

Dear Dr.ou, We always meet the problem (Can't locate object method "end" via package "Thread::Queue" at LTR.identifier.pl) when using LTR_retriever to integrate LTR , and it produces nothing but two empty file(CHRR_integrated.shanli.fa.nmtf.pass.list, CHRR_integrated.shanli.fa.defalse). We can't find the reson, would you like to give us some suggestions? many thaks for that, best regards

oushujun commented 4 years ago

You may want to reinstall the program. Looks like perl is not ready. Try conda install -c bioconda ltr_retriever

On Sun, Mar 1, 2020, 7:09 AM chenjunhui notifications@github.com wrote:

Dear Dr.ou, We always meet the problem (Can't locate object method "end" via package "Thread::Queue" at LTR.identifier.pl) when using LTR_retriever to integrate LTR , and it produces nothing but two empty file(CHRR_integrated.shanli.fa.nmtf.pass.list, CHRR_integrated.shanli.fa.defalse). We can't find the reson, would you like to give us some suggestions? many thaks for that, best regards

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/55?email_source=notifications&email_token=ABNX4NEU6NKFWMTDW7PFKGDRFJNBZA5CNFSM4I7IDQPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENM6VOA#issuecomment-593095352, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NED44DFPY3WCSZPWHLRFJNBZANCNFSM4I7IDQPA .

Atvar2 commented 4 years ago

Thank you for your reply,

It's very useful. We used LTR_finder and LTR_harvest for identifying LTR of specie with identity 0.85, then integrate LTR through LTR_retriever software. However, we only obtained 297 intact ltr in *.pass.list file.Would you give us some suggestions? regards

oushujun commented 4 years ago

Hi @chenjunhui,

You may check if the raw.scn file has similar line numbers to the sum of both inputs. If so, then the program should be correctly executed. Low number of intact could be due to many biological (e.g., genome size, boom and burst) or technical reasons (ie., assembly quality).

Best, Shujun

Atvar2 commented 4 years ago

Hello, Dr.ou, Thank you very much for your suggetions. We got it. Another question, how to obtain statistc of copia, sire, TF and so on basing on the LTR_retrivers? We wanted to study sub-familes of each intact LTR.

regards

oushujun commented 4 years ago

LTR_retriever classify intact elements to the superfamily level, you may need to use other programs for further classifications, such as TEsorter.

Shujun

On Wed, Mar 11, 2020 at 2:22 AM chenjunhui notifications@github.com wrote:

Hello, Dr.ou, Thank you very much for your suggetions. We got it. Another question, how to obtain statistc of copia, sire, TF and so on basing on the LTR_retrivers? We wanted to study sub-familes of each intact LTR.

regards

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/55?email_source=notifications&email_token=ABNX4NHSE66IEMEJ2LN4KIDRG5C4ZA5CNFSM4I7IDQPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOOSUQQ#issuecomment-597502530, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NEGHRAWZAOBA3Y75RTRG5C4ZANCNFSM4I7IDQPA .

Atvar2 commented 4 years ago

Dear Dr.ou,

Thank you for your help. However, which fasta file should we provide to TEsorter if we use it for dividing the result of LTR_retrivers?

regards

zhaoruolan commented 4 years ago

Hi Shujun,

Sorry I hadn't checked this thread for a long time. Thank you for asking! :) My supervisor found out that my ltrharvest files contained excessive lines. After deleting those irrelevant lines, I successfully got LAIs of chr1As for the three wheat genomes including iwgsc v1.0. But when I went to the subgenome level, I can't get LAIs any more. The problem maybe still related to RepeatMasker. LTR_retriever can't generate file.fasta.out files. There is no error message popping out either, every step looks fine.

Here is the log file when I ran for iwgsc subgenome A: Thursday 12 March 04:20:42 AEDT 2020 Dependency checking: All passed! Thursday 12 March 04:21:11 AEDT 2020 LTR_retriever is starting from the Init step. Thursday 12 March 04:22:33 AEDT 2020 Start to convert inputs... Total candidates: 199920 Total uniq candidates: 99960

Thursday 12 March 04:24:22 AEDT 2020 Module 1: Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

Thursday 12 March 05:02:03 AEDT 2020 82545 clean candidates remained

Thursday 12 March 05:02:03 AEDT 2020 Modules 2-5: Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thursday 12 March 11:35:51 AEDT 2020 Intact LTR-RT found: 27495

Thursday 12 March 20:10:13 AEDT 2020 Module 6: Start to analyze truncated LTR-RTs... Truncated LTR-RTs without the intact version will be retained in the LTR-RT library. Use -notrunc if you don't want to keep them.

Thursday 12 March 20:10:14 AEDT 2020 13520 truncated LTR-RTs found Friday 13 March 05:30:08 AEDT 2020 143 truncated LTR sequences have added to the library

Friday 13 March 05:30:08 AEDT 2020 Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences... Total library sequences: 10983 Warning: [blastx] Number of threads was reduced to 16 to match the number of available CPUs Warning: [blastx] Number of threads was reduced to 16 to match the number of available CPUs Warning: [blastx] Number of threads was reduced to 16 to match the number of available CPUs Friday 13 March 08:50:20 AEDT 2020 Retained clean sequence: 10982

Friday 13 March 08:50:20 AEDT 2020 Sequence clustering for /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.ltrTE ... Friday 13 March 08:50:20 AEDT 2020 Unique lib sequence: 10969

Friday 13 March 10:33:20 AEDT 2020 Module 6: Start to remove nested insertions in internal regions... Friday 13 March 11:29:49 AEDT 2020 Raw internal region size (bit): 59812593 Clean internal region size (bit): 16378709

Friday 13 March 11:30:04 AEDT 2020 Sequence number of the redundant LTR-RT library: 82628 The redundant LTR-RT library size (bit): 260814003

Friday 13 March 11:30:04 AEDT 2020 Module 8: Start to make non-redundant library...

Friday 13 March 11:40:01 AEDT 2020 Final LTR-RT library entries: 7503 Final LTR-RT library size (bit): 20209791

Friday 13 March 11:40:02 AEDT 2020 Total intact LTR-RTs found: 27495 Total intact non-TGCA LTR-RTs found: 913

Friday 13 March 11:42:25 AEDT 2020 All analyses were finished!

############################## ####### Result files ######### ############################## Table output for intact LTR-RTs (detailed info) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.pass.list (All LTR-RTs) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.nmtf.pass.list (Non-TGCA LTR-RTs) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.pass.list.gff3 (GFF3 format for intact LTR-RTs)

LTR-RT library /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.LTRlib.redundant.fa (All LTR-RTs with redundancy) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.LTRlib.fa (All non-redundant LTR-RTs) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.nmtf.LTRlib.fa (Non-TGCA LTR-RTs)

I can't get the LAIs for other subgenomes either. So I went to RepeatMasker again. I read your script and saw this line: RepeatMasker -e ncbi -qq -pa $threads -no_is -norna -nolow -div 40 -lib $index.ltrTE.pass.LTR -cutoff 225 $index2.ltrTE.LTR 2

I thought I could run this line separately to get the file.fasta.out file. $index.ltrTE.pass.LTR can be generated using perl $script_path/bin/call_seq_by_list.pl $index.retriever.scn.adj.list -C $genome itself > $index.ltrTE.pass.LTR. I have $index.retriever.scn.adj.list, so it is fine. But I don't know where $index2.ltrTE.LTR comes from..

If I only use: RepeatMasker -qq -pa $threads -no_is -norna -nolow -div 40 -species wheat Subgenome_A.iwgsc_v1.0.fasta, can I get a similar result?

Thank you very much!

Cheers, Ruolan

oushujun commented 4 years ago

Hi Ruolan,

Thanks for getting back to me, you were running LTR_retriever without annotation. You can run RepeatMasker using the library and the subgenome to get the .out file, then use the LAI program to get the LAI estimation. You can also run LTR_retriever on the entire genome, then use the LAI program to estimate for different subgenomes.

Best, Shujun

On Sat, Mar 14, 2020 at 5:41 PM zhaoruolan notifications@github.com wrote:

Hi Shujun,

Sorry I hadn't checked this thread for a long time. Thank you for asking! :) My supervisor found out that my ltrharvest files contained excessive lines. After deleting those irrelevant lines, I successfully got LAIs of chr1As for the three wheat genomes including iwgsc v1.0. But when I went to the subgenome level, I can't get LAIs any more. The problem maybe still related to RepeatMasker. LTR_retriever can't generate file.fasta.out files. There is no error message popping out either, every step looks fine.

Here is the log file when I ran for iwgsc subgenome A: Thursday 12 March 04:20:42 AEDT 2020 Dependency checking: All passed! Thursday 12 March 04:21:11 AEDT 2020 LTR_retriever is starting from the Init step. Thursday 12 March 04:22:33 AEDT 2020 Start to convert inputs... Total candidates: 199920 Total uniq candidates: 99960

Thursday 12 March 04:24:22 AEDT 2020 Module 1: Start to clean up candidates... Sequences with 10 missing bp or 0.8 missing data rate will be discarded. Sequences containing tandem repeats will be discarded.

Thursday 12 March 05:02:03 AEDT 2020 82545 clean candidates remained

Thursday 12 March 05:02:03 AEDT 2020 Modules 2-5: Start to analyze the structure of candidates... The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thursday 12 March 11:35:51 AEDT 2020 Intact LTR-RT found: 27495

Thursday 12 March 20:10:13 AEDT 2020 Module 6: Start to analyze truncated LTR-RTs... Truncated LTR-RTs without the intact version will be retained in the LTR-RT library. Use -notrunc if you don't want to keep them.

Thursday 12 March 20:10:14 AEDT 2020 13520 truncated LTR-RTs found Friday 13 March 05:30:08 AEDT 2020 143 truncated LTR sequences have added to the library

Friday 13 March 05:30:08 AEDT 2020 Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences... Total library sequences: 10983 Warning: [blastx] Number of threads was reduced to 16 to match the number of available CPUs Warning: [blastx] Number of threads was reduced to 16 to match the number of available CPUs Warning: [blastx] Number of threads was reduced to 16 to match the number of available CPUs Friday 13 March 08:50:20 AEDT 2020 Retained clean sequence: 10982

Friday 13 March 08:50:20 AEDT 2020 Sequence clustering for /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.ltrTE ... Friday 13 March 08:50:20 AEDT 2020 Unique lib sequence: 10969

Friday 13 March 10:33:20 AEDT 2020 Module 6: Start to remove nested insertions in internal regions... Friday 13 March 11:29:49 AEDT 2020 Raw internal region size (bit): 59812593 Clean internal region size (bit): 16378709

Friday 13 March 11:30:04 AEDT 2020 Sequence number of the redundant LTR-RT library: 82628 The redundant LTR-RT library size (bit): 260814003

Friday 13 March 11:30:04 AEDT 2020 Module 8: Start to make non-redundant library...

Friday 13 March 11:40:01 AEDT 2020 Final LTR-RT library entries: 7503 Final LTR-RT library size (bit): 20209791

Friday 13 March 11:40:02 AEDT 2020 Total intact LTR-RTs found: 27495 Total intact non-TGCA LTR-RTs found: 913

Friday 13 March 11:42:25 AEDT 2020 All analyses were finished!

############################## ####### Result files ######### ############################## Table output for intact LTR-RTs (detailed info) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.pass.list (All LTR-RTs) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.nmtf.pass.list (Non-TGCA LTR-RTs) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.pass.list.gff3 (GFF3 format for intact LTR-RTs)

LTR-RT library /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.LTRlib.redundant.fa (All LTR-RTs with redundancy) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.LTRlib.fa (All non-redundant LTR-RTs) /mnt2/s-ws2/rzhao/LTR/IWGSC/Subgenome_A.iwgsc_v1.0.fasta.nmtf.LTRlib.fa (Non-TGCA LTR-RTs)

I can't get the LAIs for other subgenomes either. So I went to RepeatMasker again. I read your script and saw this line: RepeatMasker -e ncbi -qq -pa $threads -no_is -norna -nolow -div 40 -lib $index.ltrTE.pass.LTR -cutoff 225 $index2.ltrTE.LTR 2

I thought I could run this line separately to get the file.fasta.out file. $index.ltrTE.pass.LTR can be generated using perl $script_path/bin/ call_seq_by_list.pl $index.retriever.scn.adj.list -C $genome itself > $index.ltrTE.pass.LTR. I have $index.retriever.scn.adj.list, so it is fine. But I don't know where $index2.ltrTE.LTR comes from..

If I only use: RepeatMasker -qq -pa $threads -no_is -norna -nolow -div 40 -species wheat Subgenome_A.iwgsc_v1.0.fasta, can I get a similar result?

Thank you very much!

Cheers, Ruolan

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/55#issuecomment-599163738, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NAUGEVXJEW5HNAV3FDRHRE5VANCNFSM4I7IDQPA .

zhaoruolan commented 4 years ago

Hi Shujun,

Thanks a lot. I've got the LAI of iwgsc v1.0 subgenome D without using -noanno. Now I finally knew I couldn't get LAI if I use -noanno... For iwgsc v1.0 subgenome A, I used the library $index.ltrTE.pass.LTR, it was much faster than using -species wheat, I got the SubgenomeA.iwgsc.v.1.0.fasta.out file this morning.

Cheers, Ruolan

oushujun commented 4 years ago

Hi Ruolan,

You should use the final library for LTR annotation, $index.ltrTE.pass.LTR only contains the LTR regions but no internal regions, which will give you underestimation of total LTR sequences.

-noanno will not give you LAI because it won't generate LTR annotation which is required. Please use RepeatMasker for LTR annotation in this case then use the LAI program to calculate LAI in the next step.

Shujun

oushujun commented 4 years ago

Let me clear things up a bit more.

In a polyploid genome like wheat, you will run LTR_retriever with annotation (default) on the entire genome (not only on subgenomes), then you will get a list of intact LTRs (*intact.pass.list) and the whole-genome LTR annotation ($genome.out). This is the first step.

Then you need to run the LAI program by yourself, and don't use the automatically generated LAI result by LTR_retriever. The LAI program is located under the same folder of LTR_retriever. You need to run this program in a subgenome mode with the -mono parameter and providing the file containing a list of sequence names of the target subgenome. You don't need to separate the intact and all LTR information based on subgenomes, the program was designed to do so for you. All you need is to provide a list of sequence names for that subgenome. This is the second step.

Now you just run LAI on one subgenome, the last step is to run LAI on the rest of the subgenomes you like to.

Hope this helps.

Shujun

Atvar2 commented 4 years ago

Dear Dr.ou,

So sorry to trouble you. We predicted LTR of C.sativa ,C.capsularis and G.raimondii using LTR_finder( -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.85) and LTRharvest (-similar 75 -seed 20 -minlenltr 100 -maxlenltr 7000 -maxdistltr 20000 -mintsd 5 -maxtsd 6 ). Then we retrivered LTR from the the result above by LTR_retriver. However, we obtained a small number of LTR, range from 1362 to 2145. We can't find any resons for that anyway, would you like to give us some suggestions?

MoradMMokhtar commented 1 year ago

@oushujun Dear Dr. Oushujun I am sorry to have bothered you so often. The Triticum aestivum IWGSC CS RefSeq v2.1 [https://www.ncbi.nlm.nih.gov/genome/?term=Triticum%20aestivum] is divided into subgenomes A, B, and D and we can use the steps you gave to calculate LAI, but my question is what about the unassigned sequences of chromosomes and subgenomes (340.79 Mb). I think if we exclude these sequences from the analysis, the LAI value will not be sensitive. Am I understanding this correctly or am I missing information here?

oushujun commented 1 year ago

Hello @MoradMMokhtar ,

Yes you need to separate the ploidy and calculate separately. For those unassigned sequences, assign them as you can, otherwise you are calculating LAI for the scaffolded part of each haploid.

Thanks, Shujun

MoradMMokhtar commented 1 year ago

@oushujun Thank you for your response back. I have no idea how to assign them. Can I treat these unassigned sequences as a separate subgenome or add them to any subgenome? Because if it is possible to assign them to a subgenome, why did the author of the genome not assign them? Another question: if you did the LAI analysis for the wheat genome, how do you treat these sequences?

oushujun commented 1 year ago

Then I will just leave them out and specify the LAI is calculated for scaffolded sequences.

On Wed, Dec 7, 2022 at 4:58 PM MoradMMokhtar @.***> wrote:

@oushujun https://github.com/oushujun Thank you for your response back. I have no idea how to assign them. Can I treat these unassigned sequences as a separate subgenome or add them to any subgenome? Because if it is possible to assign them to a subgenome, why did the author of the genome not assign them? Another question: if you did the LAI analysis for the wheat genome, how do you treat these sequences?

— Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/55#issuecomment-1341646861, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NHBBVJUS5CCDKMAJA3WMEJBVANCNFSM4I7IDQPA . You are receiving this because you were mentioned.Message ID: @.***>

MoradMMokhtar commented 1 year ago

Ok, Thanks again for your time

oushujun / LTR_retriever

Assessing a wheat genome assembly #55