oushujun / LTR_retriever

LTR_retriever is a highly accurate and sensitive program for identification of LTR retrotransposons; The LTR Assembly Index (LAI) is also included in this package.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813529/
GNU General Public License v3.0
176 stars 40 forks source link

LAI does not improve despite sequencing and assembly improvements #87

Open oushujun opened 3 years ago

oushujun commented 3 years ago

Dear shujun, First of all apologize for my bad English. I have sequenced dozens species, these species belong to the same genus (eg. wild, cultivars, landrances), and I will construct a pan-genome. Sorry I am not an expert in sequencing experiments either. But I know they all use the same kit and extraction method to build the library.

Here I give you a detailed description of the only sample with LAI < 20. This sample named SampleA. At the beginning of my project, dozens of samples were all sequenced about 120 ~ 160X by Pacbio. The subreads length also well and subreads N50 are 10 ~ 20k. This looks really good so I assembled them by CANU and obtained dozens of genomes. The LAI value of all these genomes exceed 20 except SampleA is 16. And the contig N50 of SampleA also very unusual (just 200kb). So we contacted technical experts to perform another round sequencing. This time I got about 500X new Pacbio data of SampleA, also with normal subreads length and subreads N50. I assembled SampleA and run LTR_retriever, the contig N50 improved to ~5Mb while the LAI is still ~16. Surprised me and incomprehensible. Since I have enough data, I also tried only use length > 8kb, 10kb, even 20kb subreads to run different assembly software, all the LAI values is between 16 and 17, stable as Mount Tai. According to our common sense and your articles published on Nature communication, high-depth sequencing will improve assembly quality. The contig N50 is significantly improved (200kb to 5Mb), but the LAI value no improvement.

While writing here, I thought about it again. If it is a problem with library building, the contig N50 will not be improved. I used to think there was a problem with the DNA extraction process, human factors like some experimental operations. However, I got reasonable geome size, high contig N50 and high BUSCO (99%). So my guess last time about sequencing library construction may be unreasonable. The problem that still bother me is high-depth sequencing assembly get a long contig N50 but low LAI value. Different assembly methods and parameters have an effect on contig N50, but it doesn’t seem to have effect on LAI. Even I only use subreads length > 10kb to run assembly the LAI is still ~16. I have high-depth sequencing and long reads, the LAI hasn’t improved.

Thank you for taking the time to discuss so much with me.

Best regards, Weihan

_Originally posted by @Weihankk in https://github.com/oushujun/LTR_retriever/issues/86#issuecomment-768498135_

oushujun commented 3 years ago

Hi Weihan,

Thank you for your detailed descriptions. I open a new issue because this is a different but interesting topic.

We demonstrated in the LAI paper, that genome size, TE content, contig N50, BUSCO, etc, are not significantly correlated with LAI. Of course back then we only had a limited number of 'good' genomes to test on, and their quality was not evenly distributed. I have noticed similar cases but those are sequenced by Nanopore. I thought it may have something to do with the sequencing technique, but from your case it may be more prevalent. I am still collecting similar cases because so far it is sporadic and thus I have limited power to detect the cause. By the time I got enough data, the raw LAI correction may need a re-calibration.

You mentioned assembling the genome with >20kb subreads, what is the coverage? Can you post one of the LAI screen outputs for this genome?

Thanks, Shujun

Weihankk commented 3 years ago

Hi Shujun,

Thank you for your patience and help. I use CANU and set parameter minReadLength=20000, which set length >20kb were select as input for assembly. By CANU report I found about 39.26x data were finally used to assemble.

The final assembly result is pretty good (reasonable genome size, low number of contigs, high contig N50). The LAI is 16.38.

Below is my LTR_retriever log, I use LTR_FINDER and ltrharvest to find LTR, all parameters are default as you set:

Parameters: -genome THG.rename.fasta -inharvest THG.rename.fasta.rawLTR.scn -threads 40

Thu Jan 28 14:43:45 CST 2021    Dependency checking: All passed!
Thu Jan 28 14:44:05 CST 2021    LTR_retriever is starting from the Init step.
Thu Jan 28 14:44:05 CST 2021    Start to convert inputs...
                                Total candidates: 4763
                                Total uniq candidates: 4175

Thu Jan 28 14:44:07 CST 2021    Module 1: Start to clean up candidates...
                                Sequences with 10 missing bp or 0.8 missing data rate will be discarded.
                                Sequences containing tandem repeats will be discarded.

Thu Jan 28 14:45:32 CST 2021    3536 clean candidates remained

Thu Jan 28 14:45:32 CST 2021    Modules 2-5: Start to analyze the structure of candidates...
                                The terminal motif, TSD, boundary, orientation, age, and superfamily will be identified in this step.

Thu Jan 28 14:47:41 CST 2021    Intact LTR-RT found: 1186

Thu Jan 28 14:48:27 CST 2021    Module 6: Start to analyze truncated LTR-RTs...
                                Truncated LTR-RTs without the intact version will be retained in the LTR-RT library.
                                Use -notrunc if you don't want to keep them.

Thu Jan 28 14:48:27 CST 2021    289 truncated LTR-RTs found
Thu Jan 28 14:49:13 CST 2021    75 truncated LTR sequences have added to the library

Thu Jan 28 14:49:13 CST 2021    Module 5: Start to remove DNA TE and LINE transposases, and remove plant protein sequences...
                                Total library sequences: 1113
Thu Jan 28 14:54:02 CST 2021    Retained clean sequence: 1107

Thu Jan 28 14:54:02 CST 2021    Sequence clustering for THG.rename.fasta.ltrTE ...
Thu Jan 28 14:54:02 CST 2021    Unique lib sequence: 1104

Thu Jan 28 14:54:31 CST 2021    Module 6: Start to remove nested insertions in internal regions...
Thu Jan 28 14:57:42 CST 2021    Raw internal region size (bit): 3872896
                                Clean internal region size (bit): 3123561

Thu Jan 28 14:57:42 CST 2021    Sequence number of the redundant LTR-RT library: 3520
                                The redundant LTR-RT library size (bit): 7872192

Thu Jan 28 14:57:42 CST 2021    Module 8: Start to make non-redundant library...

Thu Jan 28 14:58:02 CST 2021    Final LTR-RT library entries: 1036
                                Final LTR-RT library size (bit): 3392750

Thu Jan 28 14:58:02 CST 2021    Total intact LTR-RTs found: 1103
                                Total intact non-TGCA LTR-RTs found: 82

Thu Jan 28 14:58:03 CST 2021    Start to annotate whole-genome LTR-RTs...
                                Use -noanno if you don't want whole-genome LTR-RT annotation.

I observed that the number of intact LTR is about 100-300 less than that of other samples. And in *.LAI files, the total LTR length seem also low.

Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       230113814       0.0310  0.2085  14.88   16.38

I didn't save the LTR_retriever log of other samples, so I only show the *.LAI files of other sample.

Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       236367658       0.0463  0.2124  21.81   23.29
Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       233869550       0.0462  0.2143  21.56   25.10
Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       229280339       0.0472  0.2073  22.78   26.07
Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       235617391       0.0510  0.2343  21.78   24.08

Please let me know if you need other information or LTR_retriever running logs of other samples, I can rerun LTR_retriever at any time.

In additionally, you mean the raw LAI correction may need a re-calibration when you got enough high-quality genomes? Like 2.8138 or the equation?Maybe I can help you to collect some high-quality reference genomes that currently have a relatively high LAI.

Thanks, Weihan

oushujun commented 3 years ago

Hi Weihan,

Thank you for the feedback. Please run LAI independently and capture the screen output, or check the directory and find the .iden (or .age, forget which one) information. I suspect if this genome has less LTR activity and the LAI was not corrected well. Another good piece of information is to plot a histogram of intact LTR age with a couple of your genomes (with bin width as 0.2 MYA). The special genome has less total LTR and also less intact LTR in the assembly, I saw similar cases in Solanaceae species.

For the correction yes it's the 2.8138 factor currently but it may need more corrections if we want to approach that route. It's hard to be fair with all species so careful evaluations are required. I suscept the length of LTR is also involved, so if you get a chance, you may also check if your genomes have different LTR length.

Best, Shujun

Weihankk commented 3 years ago

Hi Shujun,

Thanks again for your patient help.


Below is my LAI screen output:

######################################
### LTR Assembly Index (LAI) beta3.2 ###
######################################

Developer: Shujun Ou

Please cite:

Ou S., Chen J. and Jiang N. (2018). Assessing genome assembly quality using the LTR Assembly Index (LAI). Nucleic Acids Res. gky730: https://doi.org/10.1093/nar/gky730

Parameters: -genome THG.rename.fasta -intact THG.rename.fasta.pass.list -all THG.rename.fasta.out -t 40

Thu Jan 28 15:20:33 CST 2021    Dependency checking: Passed!
Thu Jan 28 15:20:33 CST 2021    Calculation of LAI will be based on the whole genome.
                                Please use the -mono parameter if your genome is a recent ployploid, for high identity between homeologues will overcorrect raw LAI scores.
Thu Jan 28 15:20:33 CST 2021    Estimate the identity of LTR sequences in the genome: standard mode
Thu Jan 28 15:21:21 CST 2021    The identity of LTR sequences: 93.4680571379919%
Thu Jan 28 15:21:21 CST 2021    Calculate LAI:

                                                Done!

Thu Jan 28 15:21:25 CST 2021    Result file: THG.rename.fasta.out.LAI

                                You may use either raw_LAI or LAI for intraspecific comparison
                                but please use ONLY LAI for interspecific comparison

Below is my THG.rename.fasta.out.q.LAI.LTR.ava.age file:

Input:THG.rename.fasta.out      Seq_num:25048   Mean_identity:93.4680571379919

Below is the histogram of intact LTR age. I plot it from *.rename.fasta.pass.list file. The bin width is 0.2 MYA.

LowLAI

HighLAI

Here I also share the simple plot script, hoping to help someone in need.

library(data.table)
pass.list <- fread("THG.rename.fasta.pass.list", header = F)
plot.dt <- pass.list$V12/1000000

x <- table(cut(plot.dt, breaks = seq(-0.2,max(plot.dt),by = 0.2)))
names(x) <- seq(0,length(x)*0.2, by = 0.2)[-1]

barplot(height =  x, main = "LAI = 16.38")

Best, Weihan

oushujun commented 3 years ago

Hi Weihan,

Thanks for sharing the data and scripts. It looks like your genomes have fewer young LTRs. Did you polish the genomes? Both Arrow and Pilon can fix sequencing errors and help to identify intact LTRs.

Best, Shujun

Weihankk commented 3 years ago

Hi Shujun,

I tried to polish the genome these days by Arrow and Pilon. However, Pilon is slower and still running, so I used NextPolish instead.

For only Arrow polish, LAI has slightly improved, from 16.38 to 16.40.

Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       230131163       0.0309  0.2059  15.01   16.40

For Arrow + NextPolish, LAI has been further improved from 16.40 to 16.48

Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       230080435       0.0312  0.2072  15.04   16.48

The result did not look as ideal as expected. Do you have any other suggestions to improve LAI? It seems that my LAI can not up to 20. As you said, the genomes have fewer young LTRs so I get a low LAI value. Does this mean that my genome is only the Reference level instead of the Golden level? This genome with LAI≈16.40 is an ancient ancestor species. If this genome actually contains a small amount of young LTRs, then it is almost impossible to get high LAI, right? (Sorry I am not familiar with this field, please point out if I understand wrong).

Best regards, Weihan

Weihankk commented 3 years ago

Hi Shujun, I just got the Pilon result and saw a significant improvement (LAI from 16.40 to 17.61)

This is the LAI for Arrow (1 round) + Pilon (1 round)

Chr     From    To      Intact  Total   raw_LAI LAI
whole_genome    1       229963784       0.0308  0.1954  15.78   17.61

The 2nd round of Pilon is still running. It seems that polish can improve LAI. Do you have any suggestions for improving LAI?

Best regards, Weihan

oushujun commented 3 years ago

Hi Weihan,

Thanks for sharing the data. At this point, you have probably tried everything you can to improve the genome. It is likely that LAI can not evaluate this ancestral genome properly. You may state that in your manuscript.

For research purposes, can you help to characterize the mean length of LTR regions from intact LTR-RTs in these genomes? In the pass.list file, you can find coordinates of these LTR regions. Thanks!

Best, Shujun

Weihankk commented 3 years ago

Hi Shujun,

Thanks for your recent help.

For the no polish genome LTR_retriever result (LAI = 16.38), the mean length of LTR regions is 6519.417.

For the Arrow + Pilon polish (LAI = 17.61), the mean length of LTR regions is 6499.817.

I calculated the length of LTR regions by the first column from pass.list file. E.g. For "Seq1:1526070..1531424", the length is 1531424-1526070.

Best regards, Weihan

oushujun commented 3 years ago

Hi Weihan,

Thanks for getting these data quickly. LTR-RTs have the structure of LTR-internal-LTR, So LTR region means either one of the LTR sequences. There is a column IN:xxx-yyy, which is the coordinate of the internal region, this will give you a way to get the coordinates of the left or right LTR region. Can you also include a couple of the high LAI genomes for this value? Thanks!

Best, Shujun

On Sat, Feb 6, 2021 at 4:46 PM Weihan notifications@github.com wrote:

Hi Shujun,

Thanks for your recent help.

For the no polish genome LTR_retriever result (LAI = 16.38), the mean length of LTR regions is 6519.417.

For the Arrow + Pilon polish (LAI = 17.61), the mean length of LTR regions is 6499.817.

I calculated the length of LTR regions by the first column from pass.list file. E.g. For "Seq1:1526070..1531424", the length is 1531424-1526070.

Best regards, Weihan

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/LTR_retriever/issues/87#issuecomment-774426794, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NGF2ERGWYQ26ZEHPF3S5T6VBANCNFSM4WWLL2JA .

Weihankk commented 3 years ago

Hi Shujun,

Thanks for your correction, the following is my recalculation results. For the LAI = 16.38, the mean length of LTR regions is 5478.012. (no polish) For the LAI = 17.61, the mean length of LTR regions is 5464.779. (Arrow + Pilon polish)

For another genome LAI = 25.10, the mean length of LTR regions is 5400.613. (no polish) For another genome LAI = 21.92, the mean length of LTR regions is 5425.018. (no polish) For another genome LAI = 20.12, the mean length of LTR regions is 5569.597. (no polish, according to our records, this sample is likely to be an ancient wild species)

If you need other results, please feel free to contact me.

Best regards, Weihan

oushujun commented 3 years ago

Hi Weihan,

Thank you for the information. It seems to be a negative correlation between LAI and LTR region length, but it doesn't make sense on the equation. Supposely if the LTR is longer, the total intact LTR length is longer and raw LAI is higher. So I think length is not playing a big role here.

I don't have other ideas at the moment. I will keep this issue open and if anybody has similar issues, they may report here. Thanks again for helping to diagnose the issue.

Best, Shujun

chaimol commented 2 years ago

hi,Shujun Ou, I get 2 different LAI value, value 1 was whole_genome 1 272430828 0.0163 0.3975 4.09 6.80, vaule 2 was whole_genome 1 272430828 0.0161 0.1220 13.23 17.19.

The parameter of value 1 is set

gt ltrharvest \
   -index ${species} \
   -similar 90 -vic 10 -seed 20 -seqids yes \
   -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 \
   -motif TGCA -motifmis 1> ${species}.harvest.scn

The parameter of value 2 is set

gt ltrharvest \
   -index ${species} \
   -similar 85 -vic 10 -seed 20 -seqids yes \
   -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 \
   -motif TGCA -motifmis 1> ${species}.harvest.scn

Other parameters are the same as the official recommendation. But the difference between the LAI values obtained by the two methods is very large, so I want to know which result is the more reliable LAI value. From the distribution of LAI of each chromosome, I prefer to believe in low LAI.

Weihankk commented 2 years ago

hi,Shujun Ou, I get 2 different LAI value, value 1 was whole_genome 1 272430828 0.0163 0.3975 4.09 6.80, vaule 2 was whole_genome 1 272430828 0.0161 0.1220 13.23 17.19.

The parameter of value 1 is set

gt ltrharvest \
   -index ${species} \
   -similar 90 -vic 10 -seed 20 -seqids yes \
   -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 \
   -motif TGCA -motifmis 1> ${species}.harvest.scn

The parameter of value 2 is set

gt ltrharvest \
   -index ${species} \
   -similar 85 -vic 10 -seed 20 -seqids yes \
   -minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 \
   -motif TGCA -motifmis 1> ${species}.harvest.scn

Other parameters are the same as the official recommendation. But the difference between the LAI values obtained by the two methods is very large, so I want to know which result is the more reliable LAI value. From the distribution of LAI of each chromosome, I prefer to believe in low LAI.

I saw this notice by just now when I had lunch, it looks very interesting... Your two LAI results indicate that both have the same intact LTR (0.0163 and 0.0161), whle the total LTR are different (0.3975 and 0.1220). That is, the numerator is almost unchanged, but the denominator has increased a lot, so the ratio (LAI) has dropped a lot.

If you combine LTRharvest and LTR_FINDER, but just change LTRharvest similar parameter, then you will get more LTR records after merging the two method results. On the other hand, if you keep the parameters of LTRharvest and LTR_FINDER are in same levels, most LTR records of the two methods maybe consistent.

Sorry I don't know if you understand what I mean. As I know, LTR_FINDER_parallele set the default similar parameter is 0.85. Maybe you could try to change the similar of LTR_FINDER_parallele same with LTRharvest to make a test while waiting for Shujun's reply.

Weihan

oushujun commented 2 years ago

Hi @chaimol,

Weihan is correct, the total LTR content annotated via approach 1 is 39.75% and the second is 12.20%. While theoretically it's impossible to know the exact LTR content in your species, you may estimate it with annotations. Your two annotations are very different, thus making the LAI values very different. You need to make a decision on which LTR content estimation is more reliable. You may also need to combine LTRharvest and LTR_FINDER inputs to make a calculation more comparable with other genomes published with the LAI method.

Best, Shujun

chaimol commented 2 years ago

Hi @chaimol,

Weihan is correct, the total LTR content annotated via approach 1 is 39.75% and the second is 12.20%. While theoretically it's impossible to know the exact LTR content in your species, you may estimate it with annotations. Your two annotations are very different, thus making the LAI values very different. You need to make a decision on which LTR content estimation is more reliable. You may also need to combine LTRharvest and LTR_FINDER inputs to make a calculation more comparable with other genomes published with the LAI method.

Best, Shujun

Sorry, I did not clarify the specific parameter settings. In fact, value1 and value2 both use LTR_FINDER_parallele set the default similar parameter is 0.85, and also use ltrharvest, but the -similar parameter setting of ltrharvest is different, the -similar setting of value1 is 90, and the -similar setting of value2 is 85. Finally use LTR_retriever -genome genome.fa -inharvest harvest.scn -infinder finder.scn -threads 36 -u 7e-9

oushujun commented 2 years ago

@chaimol It's your responsibility to determine which estimation of total LTR content is closer to the truth. If you think, for example, the total LTR content should be 50%, then you need to use -totLTR 50 to inform the LAI program. Otherwise, it will try to estimate it based on the whole-genome LTR annotation on your assembly, which is not necessarily correct.

Shujun

frabanal commented 2 years ago

Hi @oushujun ,

I'm following up on a topic I started in EDTA, but has to do with estimating LAI within species, sometimes between different assemblies of the same genotype: CLR vs HiFi, or different assemblers. Therefore, total assembly –and scaffolded– sizes can be quite different, which may or may not be relevant to my question.

Below are the LAI values for the various assemblies of the same genotype, but notice that they are performed in scaffolded assemblies with different sizes:

CLR_Canu_scaffolds:whole_genome        1       121215396       0.0133  0.0680  19.62   20.89
HiFi_IPA_scaffolds:whole_genome           1       123435913       0.0129  0.0669  19.34   20.34
HiFi_HiCanu_scaffolds:whole_genome    1       135570366       0.0131  0.0652  20.13   20.40
HiFi_FALCON_scaffolds:whole_genome  1       136095093       0.0137  0.0678  20.19   19.98
HiFi_Hifiasm_scaffolds:whole_genome    1       136162473       0.0139  0.0688  20.15   20.42

Following your advice, I added -totLTR 6.88 (taken from the $ASS.mod.out.LAI output file of the most complete assembly) and -iden 93.90 (taken from the $ASS.mod.out.LAI.LTR.ava.age output file of the most complete assembly). In any case, these numbers are all way too similar across assemblies. Unfortunately, it does not seem that the constant -genome_size 142000000 I'm providing has been picked up for the analysis. Actually, the -genome_size parameter is not even listed among LAI options. Am I missing something? I'm running LTR_retriever_v2.9.0.

Here the new LAI values:

CLR_Canu_scaffolds:whole_genome       1       121215396       0.0133  0.0688  19.39   19.67
HiFi_IPA_scaffolds:whole_genome     1       123435913       0.0129  0.0688  18.81   19.09
HiFi_HiCanu_scaffolds:whole_genome      1       135570366       0.0131  0.0688  19.07   19.35
HiFi_FALCON_scaffolds:whole_genome  1       136095093       0.0137  0.0688  19.89   20.17
HiFi_Hifiasm_scaffolds:whole_genome      1       136162473       0.0139  0.0688  20.15   20.43

Since a constant genome size parameter did not work, would it be valid or too unfair for the smaller assemblies to scale the Intact and Total percentages to the "real" genome size?

In a way, I'm not too surprised that LAI values are not that different even between CLR and the best HiFi assembly. I have good evidence that the Megabases missing in the CLR assembly are mostly centromeres and rDNA clusters, and not that many contig breaks are due to TEs. This is why I have my doubts whether estimating LAI in the smaller assemblies with the parameters from the largest assembly. At the moment, I kind of favour the raw_LAI from the first list (without fixed -totLTR and -iden). I would truly appreciate you input on this topic.

Kindly,

Fernando

oushujun commented 2 years ago

Hello Fernando,

I have not pushed the -genome_size parameter to the public now I do. Please update your LTR_retriever and recalculate.

As I stated in the LAI paper and also in the LAI output, raw LAI is suitable for within species comparisons while LAI works for both within and between species comparisons. However, raw LAI did not have genome size controlled. Having -totLTR, -iden, and -genome_size controlled makes your results compariable between different assemblies of the same species and also makes it compariable to other genomes.

Judging on the results shared above, these assemblies are very close to each other in terms of TE assembly quality. You may want to look at other quality metrics such as N50, BUSCO, or assembly errors to further select your best assembly strategy/result.

Best, Shujun

frabanal commented 2 years ago

Hi @oushujun,

Thanks for pushing the -genome_size parameter. I can confirm it works well with the latest LTR_retriever.

Just for the purpose of completeness in this thread, I post here the LAI values of the same assemblies having controlled for -totLTR, -iden, and -genome_size.

CLR_Canu_scaffolds:whole_genome   1       142000000       0.0114  0.0688  16.55   16.83                                                                   
HiFi_IPA_scaffolds:whole_genome       1       142000000       0.0112  0.0688  16.35   16.63                                                                           
HiFi_HiCanu_scaffolds:whole_genome 1       142000000       0.0125  0.0688  18.21   18.49                                                                           
HiFi_FALCON_scaffolds:whole_genome     1       142000000       0.0131  0.0688  19.07   19.35                                                           
HiFi_Hifiasm_scaffolds:whole_genome       1       142000000       0.0133  0.0688  19.32   19.60                                                                   

In this case, it becomes apparent that the fixed genome size heavily penalises smaller assemblies. I'm not convinced these are the best parameter choices in this particular case, due to the fact that I know that what is missing in the smaller assemblies are mainly the core centromeres. One would have to make the assumption that these loci carry the same proportion of LTRs as the rest of the genome to make this extrapolation valid.

As you've suggested, I'll complement these observations with other quality metrics.

Thanks! Fernando

oushujun commented 2 years ago

Hi Fernando,

Thank you for sharing your results. If you find the fixed genome size penalises smaller assemblies too much, it could be that the genome size is set too high. If the assemblies are smaller due to lacking some components of the genome, then the lower LAI value correctly reflects this. Centromeric regions usually harbor higher percentage of LTR sequences compared to whole-genome levels. Missing centromeres may contribute to lower assembly quality of LTR sequences genome-wide. If you want to compare the quality of the assembled part of the genome, you may want to extract the assembled part of all assembles based on synteny, then compute LAI on these sequences.

Best, Shujun