oushujun / EDTA

Extensive de-novo TE Annotator
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1905-y
GNU General Public License v3.0
347 stars 73 forks source link

How to get softmasked genome as output #166

Closed romseg closed 3 years ago

romseg commented 3 years ago

Dear author,

Is it possible to get softmasked genome instead of the hardmasked default? Sometimes softmasking is required or recommended as input by other annotator (other than Maker) or mapping programs. So it would be very useful to have this option. Please if this option is not currently available in Braker, I would appreciate to have your suggestions on how to convert the hardmasked file to softmasked. Thanks!

oushujun commented 3 years ago

Hello, yes! This functionality can be achieved using EDTA/util/ make_masked.pl

Please try it out and let me know if you have any question.

Best, Shujun

On Thu, Feb 25, 2021 at 3:22 AM romseg notifications@github.com wrote:

Dear author,

Is it possible to get softmasked genome instead of the hardmasked default? Sometimes softmasking is required or recommended as input by other annotator (other than Maker) or mapping programs. So it would be very useful to have this option. Please if this option is not currently available in Braker, I would appreciate to have your suggestions on how to convert the hardmasked file to softmasked. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/166, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDBR7NE5NEGU7QZTITTAVGYVANCNFSM4YFDKDBA .

romseg commented 3 years ago

The usage for 'make_masked.pl' is:

Usage: perl make_masked.pl -genome unmasked_genome.fa [options]
        -rmout  [file]  Required. The repeatmasker.out file

But I don't have the 'repeatmasker.out' file. Can I use the hardmasked EDTA output file 'genome.fa.new.masked' instead?

Thanks for your help!

oushujun commented 3 years ago

You may find the rm out file in the anno folder.

Shujun

On Fri, Feb 26, 2021 at 2:00 PM romseg notifications@github.com wrote:

The usage for 'make_masked.pl' is:

Usage: perl make_masked.pl -genome unmasked_genome.fa [options] -rmout [file] Required. The repeatmasker.out file

But I don't have the 'repeatmasker.out' file. Can I use the hardmasked EDTA output file 'genome.fa.new.masked' instead?

Thanks for your help!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/166#issuecomment-786432092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NBJWZRUJ2KNOAGJCK3TA42HPANCNFSM4YFDKDBA .

romseg commented 3 years ago

Oh, I see. I believe it is this one 'genome.fa.mod.EDTA.RM.out'. I would give it a try. Thanks for your help! :)

Rom

romseg commented 3 years ago

Hi Shujun,

It did its job, but in addition to softmasking all sequences that was hardmasked in the original 'genome.fa.mod.MAKER.masked' (99Mbp), 'make_masked.pl' with 'genome.fa.mod.EDTA.RM.out' softmasked extra ~50Mbp (149Mbp). It softmasked extra short fragments and in many cases amplified the previously hardmasked fragments. I can't tell what these extra softmasked sequences are. I am wondering why the difference and which masking file version would be more useful for genome gene annotation (with Maker and/or Braker). At first glance the softmasked version generated with RM.out would seem more complete (149Mbp). Thanks!

Best, Rom

oushujun commented 3 years ago

Hi Rom,

The MAKER.masked file was lightly (under) masked to avoid masking genic regions. Like you observed, short TEs won't be masked due to their close distance to genes. If you use this file to perform gene predictions, you will likely get some TEs in your results. Please check out the output section of the manual for more info.

Best, Shujun

On Wed, Mar 3, 2021 at 6:15 AM romseg notifications@github.com wrote:

Hi Shujun,

It did its job, but in addition to softmasking all sequences that was hardmasked in the original 'genome.fa.mod.MAKER.masked' (99Mbp), ' make_masked.pl' with 'genome.fa.mod.EDTA.RM.out' softmasked extra ~50Mbp (149Mbp). It softmasked extra short fragments and in many cases amplified the previously hardmasked fragments. I can't tell what these extra softmasked sequences are. I am wondering why the difference and which masking file version would be more useful for genome gene annotation (with Maker and/or Braker). At first glance the softmasked version generated with RM.out would seem more complete (149Mbp). Thanks!

Best, Rom

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/166#issuecomment-789258699, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NFAFQVKSLR52EWG3C3TBVPO5ANCNFSM4YFDKDBA .

romseg commented 3 years ago

Hi Shujun,

That makes sense. It is good to avoid masking genic regions, especially for annotation.

One final question on this masking topic, in the stats of my sum file I observed that 256191256 bp [256Mbp] (51.54% of the total length) is reported as bpMasked (please see below) since they were found as TE elements. This number is higher to the number of hardmasked bp in the MAKER.masked file (99Mbp) or the softmasked one I produced with the 'make_masked.pl' script (149Mbp). Is this difference also to avoid masking genic regions? At first glance it would seem a big downscale from 256 to 99Mbp, but maybe I am not interpreting the results reported in the sum file well. I would be grateful to have your thoughts. Thanks! Btw, this is a plant genome.

Repeat Classes
==============
Total Sequences: 396
Total Length: 497039057 bp
Class                  Count        bpMasked    %masked
=====                  =====        ========     =======
LTR                    --           --           --   
    Copia              144280       69906519     14.06% 
    Gypsy              39887        29166065     5.87% 
    unknown            88174        32473354     6.53% 
TIR                    --           --           --   
    CACTA              65760        25490265     5.13% 
    Mutator            86889        39879220     8.02% 
    PIF_Harbinger      85261        22057616     4.44% 
    Tc1_Mariner        7590         1603803      0.32% 
    hAT                73115        24595527     4.95% 
nonTIR                 --           --           --   
    helitron           45381        11018887     2.22% 
                      ---------------------------------
    total interspersed 636337       256191256    51.54%

---------------------------------------------------------
Total                  636337       256191256    51.54%

The best, Rom

oushujun commented 3 years ago

Hi Rom,

The sum file has all sequences of what EDTA believes as TEs. The MAKER.masked file is a subset of the sum file, which was produced by make_masked.pl with parameters -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N. You may need to change the parameter for make_masked.pl to make a softmasked version close to what's described in the sum file. I need to correct myself, that *you'd better use the `EDTA.anno/EDTA.TEanno.outfile to produce the masked genome because this is the most complete**. eg. perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out`

Best, Shujun

romseg commented 3 years ago

Hi Shujun,

It worked pretty good! Masking with genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out and the suggested parameters produced 254947490 softmasked bp, which is very close to the reported 256191256 bpMasked in the sum file of my genome.

It is good to have all these masking alternatives for downstream processing. Thanks for the assistance and for designing EDTA! It is a great program that makes research so much easier.

All my questions were answered and this thread can be closed.

The best, Rom

SC-Duan commented 3 years ago

Hi Shujun, I want to get a softmask genome with myself repeat library, and feed to BRAKER. I have no the mod.EDTA.TEanno.out file, and I want to use RepeatMasker and ask for your help.

  1. You said "The MAKER.masked file is a subset of the sum file, which was produced by make_masked.pl with parameters -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N. ", Is this parameter recommended for BRAKER? Or should I "change the parameter for make_masked.pl to make a softmasked version close to what's described in the sum file."
  2. Are the parameters "-maxdiv 30 -minscore 1000" in make_masked.pl corresponding to parameters "-div 30 -cutoff 1000 " in RepeatMasker? and what is corresponding to "-minlen 1000"? If yes, should I still set parameters "-nolow -norna"?
  3. I check the EDTA.pl file and found that lines: #make low-threshold masked genome for MAKER `perl \$make_masked -genome \$genome -rmout \$genome.out -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N -threads $threads -exclude \$exclude` Should I use EDTA.anno/*EDTA.TEanno.out file or genome.fa.mod.EDTA.RM.out to mask genome? Thank you very much!
oushujun commented 3 years ago

@dzaccook

  1. To get a softmasked genome you need to use -hardmask 0. For other parameters, I am not sure if there is a better parameter space for BRAKER. The purpose of this script is to filter out short TEs since some of them are overlapping with genes, and masking such information may interfere with gene annotation algorithms. Frankly speaking, I am not familiar with the algorithms of gene annotators. So you may need to play around with different settings to find out.

  2. Yes, (-maxdiv 30 -minscore 1000) = (-div 30 -cutoff 1000). There is no equivalent parameter in RepeatMasker for "-minlen 1000" ASAIK. For the purpose of removing non-genic sequences, you probably want to include "-nolow -norna" but this is presumptuous and not fully benchmarked.

  3. It doesn't really matter that much. They are highly overlapped, and those that don't and pass through the filtering scheme probably won't have a huge impact on your gene annotation.

Shujun

SC-Duan commented 3 years ago

Hi Shujun, Thank you very much! I will try it. The best, zac

FengjuanjuanCMS commented 2 years ago

I used make_masked.pl and the output results are all empty files. Has anyone encountered and guided the reason?

Thank you very much

oushujun commented 2 years ago

@FengjuanjuanCMS you may need to check the repeatmasker output file provided to the --rmout parameter. --Shujun

Wanjie-Feng commented 10 months ago

@oushujun hi, shujun I wonder if softmask.genome can be directly used for subsequent gene structure annotation if I use the following command to sofrmask my genome.

perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out

Do I need to consider simple repeat sequences ? In addition, has the telomere sequence been passed through the above command by softmask ?

oushujun commented 10 months ago

Mostly just TEs. For gene annotation purpose you may want to unmask shorter TEs(eg <500bp) to preserve the gene space. Check out the wiki.

Shujun

On Wed, Jan 10, 2024 at 9:03 PM wanjie @.***> wrote:

@oushujun https://github.com/oushujun hi, shujun I wonder if softmask.genome can be directly used for subsequent gene structure annotation if I use the following command to sofrmask my genome.

perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out

Do I need to consider simple repeat sequences ? In addition, has the telomere sequence been passed through the above command by softmask ?

— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/166#issuecomment-1886078638, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NCTHFKOA76UG5BFF5TYN5B65AVCNFSM4YFDKDBKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBYGYYDOOBWGM4A . You are receiving this because you were mentioned.Message ID: @.***>

Wanjie-Feng commented 10 months ago

1.

Hi Rom,

The sum file has all sequences of what EDTA believes as TEs. The MAKER.masked file is a subset of the sum file, which was produced by make_masked.pl with parameters -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N. You may need to change the parameter for make_masked.pl to make a softmasked version close to what's described in the sum file. I need to correct myself, that *you'd better use the `EDTA.anno/EDTA.TEanno.outfile to produce the masked genome because this is the most complete**. eg.perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out`

Best, Shujun

2.

“4. Low-threshold TE masking: $genome.mod.MAKER.masked. This is a genome file with only long TEs (>=1 kb) being masked. You may use this for de novo gene annotations. In practice, this approach will reduce overmasking for genic regions, which can improve gene prediction quality. However, initial gene models should contain TEs and need further filtering. ”

3.

Mostly just TEs. For gene annotation purpose you may want to unmask shorter TEs(eg <500bp) to preserve the gene space. Check out the wiki. Shujun

-------------------------

From the above information, I think the following code is appropriate if I want to get the softmask genome for further annotation of de novo gene structure:

perl ../util/make_masked.pl -genome genome.fa -minlen 500 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out
chun-he-316 commented 8 months ago

Hi Shujun, I want to get softmasked genome. I used this command "perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out",but this error occurred "Permission denied ../util/make_masked.pl line 54." Please tell me how to solve this issue. Thanks. The best, Chun