Closed romseg closed 3 years ago
Hello, yes! This functionality can be achieved using EDTA/util/ make_masked.pl
Please try it out and let me know if you have any question.
Best, Shujun
On Thu, Feb 25, 2021 at 3:22 AM romseg notifications@github.com wrote:
Dear author,
Is it possible to get softmasked genome instead of the hardmasked default? Sometimes softmasking is required or recommended as input by other annotator (other than Maker) or mapping programs. So it would be very useful to have this option. Please if this option is not currently available in Braker, I would appreciate to have your suggestions on how to convert the hardmasked file to softmasked. Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/166, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NDBR7NE5NEGU7QZTITTAVGYVANCNFSM4YFDKDBA .
The usage for 'make_masked.pl' is:
Usage: perl make_masked.pl -genome unmasked_genome.fa [options]
-rmout [file] Required. The repeatmasker.out file
But I don't have the 'repeatmasker.out' file. Can I use the hardmasked EDTA output file 'genome.fa.new.masked' instead?
Thanks for your help!
You may find the rm out file in the anno folder.
Shujun
On Fri, Feb 26, 2021 at 2:00 PM romseg notifications@github.com wrote:
The usage for 'make_masked.pl' is:
Usage: perl make_masked.pl -genome unmasked_genome.fa [options] -rmout [file] Required. The repeatmasker.out file
But I don't have the 'repeatmasker.out' file. Can I use the hardmasked EDTA output file 'genome.fa.new.masked' instead?
Thanks for your help!
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/166#issuecomment-786432092, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NBJWZRUJ2KNOAGJCK3TA42HPANCNFSM4YFDKDBA .
Oh, I see. I believe it is this one 'genome.fa.mod.EDTA.RM.out'. I would give it a try. Thanks for your help! :)
Rom
Hi Shujun,
It did its job, but in addition to softmasking all sequences that was hardmasked in the original 'genome.fa.mod.MAKER.masked' (99Mbp), 'make_masked.pl' with 'genome.fa.mod.EDTA.RM.out' softmasked extra ~50Mbp (149Mbp). It softmasked extra short fragments and in many cases amplified the previously hardmasked fragments. I can't tell what these extra softmasked sequences are. I am wondering why the difference and which masking file version would be more useful for genome gene annotation (with Maker and/or Braker). At first glance the softmasked version generated with RM.out would seem more complete (149Mbp). Thanks!
Best, Rom
Hi Rom,
The MAKER.masked file was lightly (under) masked to avoid masking genic regions. Like you observed, short TEs won't be masked due to their close distance to genes. If you use this file to perform gene predictions, you will likely get some TEs in your results. Please check out the output section of the manual for more info.
Best, Shujun
On Wed, Mar 3, 2021 at 6:15 AM romseg notifications@github.com wrote:
Hi Shujun,
It did its job, but in addition to softmasking all sequences that was hardmasked in the original 'genome.fa.mod.MAKER.masked' (99Mbp), ' make_masked.pl' with 'genome.fa.mod.EDTA.RM.out' softmasked extra ~50Mbp (149Mbp). It softmasked extra short fragments and in many cases amplified the previously hardmasked fragments. I can't tell what these extra softmasked sequences are. I am wondering why the difference and which masking file version would be more useful for genome gene annotation (with Maker and/or Braker). At first glance the softmasked version generated with RM.out would seem more complete (149Mbp). Thanks!
Best, Rom
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/166#issuecomment-789258699, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NFAFQVKSLR52EWG3C3TBVPO5ANCNFSM4YFDKDBA .
Hi Shujun,
That makes sense. It is good to avoid masking genic regions, especially for annotation.
One final question on this masking topic, in the stats of my sum file I observed that 256191256 bp [256Mbp] (51.54% of the total length) is reported as bpMasked (please see below) since they were found as TE elements. This number is higher to the number of hardmasked bp in the MAKER.masked file (99Mbp) or the softmasked one I produced with the 'make_masked.pl' script (149Mbp). Is this difference also to avoid masking genic regions? At first glance it would seem a big downscale from 256 to 99Mbp, but maybe I am not interpreting the results reported in the sum file well. I would be grateful to have your thoughts. Thanks! Btw, this is a plant genome.
Repeat Classes
==============
Total Sequences: 396
Total Length: 497039057 bp
Class Count bpMasked %masked
===== ===== ======== =======
LTR -- -- --
Copia 144280 69906519 14.06%
Gypsy 39887 29166065 5.87%
unknown 88174 32473354 6.53%
TIR -- -- --
CACTA 65760 25490265 5.13%
Mutator 86889 39879220 8.02%
PIF_Harbinger 85261 22057616 4.44%
Tc1_Mariner 7590 1603803 0.32%
hAT 73115 24595527 4.95%
nonTIR -- -- --
helitron 45381 11018887 2.22%
---------------------------------
total interspersed 636337 256191256 51.54%
---------------------------------------------------------
Total 636337 256191256 51.54%
The best, Rom
Hi Rom,
The sum file has all sequences of what EDTA believes as TEs. The MAKER.masked file is a subset of the sum file, which was produced by make_masked.pl
with parameters -maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N
. You may need to change the parameter for make_masked.pl
to make a softmasked version close to what's described in the sum file. I need to correct myself, that *you'd better use the `EDTA.anno/EDTA.TEanno.outfile to produce the masked genome because this is the most complete**. eg.
perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out`
Best, Shujun
Hi Shujun,
It worked pretty good! Masking with genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out and the suggested parameters produced 254947490 softmasked bp, which is very close to the reported 256191256 bpMasked in the sum file of my genome.
It is good to have all these masking alternatives for downstream processing. Thanks for the assistance and for designing EDTA! It is a great program that makes research so much easier.
All my questions were answered and this thread can be closed.
The best, Rom
Hi Shujun, I want to get a softmask genome with myself repeat library, and feed to BRAKER. I have no the mod.EDTA.TEanno.out file, and I want to use RepeatMasker and ask for your help.
@dzaccook
To get a softmasked genome you need to use -hardmask 0
. For other parameters, I am not sure if there is a better parameter space for BRAKER. The purpose of this script is to filter out short TEs since some of them are overlapping with genes, and masking such information may interfere with gene annotation algorithms. Frankly speaking, I am not familiar with the algorithms of gene annotators. So you may need to play around with different settings to find out.
Yes, (-maxdiv 30 -minscore 1000) = (-div 30 -cutoff 1000). There is no equivalent parameter in RepeatMasker for "-minlen 1000" ASAIK. For the purpose of removing non-genic sequences, you probably want to include "-nolow -norna" but this is presumptuous and not fully benchmarked.
It doesn't really matter that much. They are highly overlapped, and those that don't and pass through the filtering scheme probably won't have a huge impact on your gene annotation.
Shujun
Hi Shujun, Thank you very much! I will try it. The best, zac
I used make_masked.pl and the output results are all empty files. Has anyone encountered and guided the reason?
Thank you very much
@FengjuanjuanCMS you may need to check the repeatmasker output file provided to the --rmout
parameter. --Shujun
@oushujun hi, shujun I wonder if softmask.genome can be directly used for subsequent gene structure annotation if I use the following command to sofrmask my genome.
perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out
Do I need to consider simple repeat sequences ? In addition, has the telomere sequence been passed through the above command by softmask ?
Mostly just TEs. For gene annotation purpose you may want to unmask shorter TEs(eg <500bp) to preserve the gene space. Check out the wiki.
Shujun
On Wed, Jan 10, 2024 at 9:03 PM wanjie @.***> wrote:
@oushujun https://github.com/oushujun hi, shujun I wonder if softmask.genome can be directly used for subsequent gene structure annotation if I use the following command to sofrmask my genome.
perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out
Do I need to consider simple repeat sequences ? In addition, has the telomere sequence been passed through the above command by softmask ?
— Reply to this email directly, view it on GitHub https://github.com/oushujun/EDTA/issues/166#issuecomment-1886078638, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNX4NCTHFKOA76UG5BFF5TYN5B65AVCNFSM4YFDKDBKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBYGYYDOOBWGM4A . You are receiving this because you were mentioned.Message ID: @.***>
Hi Rom,
The sum file has all sequences of what EDTA believes as TEs. The MAKER.masked file is a subset of the sum file, which was produced by
make_masked.pl
with parameters-maxdiv 30 -minscore 1000 -minlen 1000 -hardmask 1 -misschar N
. You may need to change the parameter formake_masked.pl
to make a softmasked version close to what's described in the sum file. I need to correct myself, that *you'd better use the `EDTA.anno/EDTA.TEanno.outfile to produce the masked genome because this is the most complete**. eg.
perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out`Best, Shujun
2.
“4. Low-threshold TE masking: $genome.mod.MAKER.masked. This is a genome file with only long TEs (>=1 kb) being masked. You may use this for de novo gene annotations. In practice, this approach will reduce overmasking for genic regions, which can improve gene prediction quality. However, initial gene models should contain TEs and need further filtering. ”
Mostly just TEs. For gene annotation purpose you may want to unmask shorter TEs(eg <500bp) to preserve the gene space. Check out the wiki. Shujun
-------------------------
From the above information, I think the following code is appropriate if I want to get the softmask genome for further annotation of de novo gene structure:
perl ../util/make_masked.pl -genome genome.fa -minlen 500 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out
Hi Shujun, I want to get softmasked genome. I used this command "perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out",but this error occurred "Permission denied ../util/make_masked.pl line 54." Please tell me how to solve this issue. Thanks. The best, Chun
Dear author,
Is it possible to get softmasked genome instead of the hardmasked default? Sometimes softmasking is required or recommended as input by other annotator (other than Maker) or mapping programs. So it would be very useful to have this option. Please if this option is not currently available in Braker, I would appreciate to have your suggestions on how to convert the hardmasked file to softmasked. Thanks!