rmhubley / RepeatMasker

RepeatMasker is a program that screens DNA sequences for interspersed repeats and low complexity DNA sequences.
Other
214 stars 48 forks source link

famdb.py and the latest version of Dfam #238

Closed cyycyj closed 6 months ago

cyycyj commented 7 months ago

Dear developer,

What do you want to know? When will the RepeatMasker or famdb.py update for the latest version of Dfam?

Helpful context I found that Dfam has just updated the Dfam.h5 version (https://www.dfam.org/releases/Dfam_3.8/families/FamDB/), and now I can only download the Viridiplantae Partition, it will take me less time to build database.

rmhubley commented 7 months ago

Immanently is the the hopeful answer. I expect either today or tomorrow.

cyycyj commented 7 months ago

Wow! That is epic! I can't wait for it and I do really appreciate your work on this great software

cyycyj commented 7 months ago

Dear Robert,

I have downloaded the lastest version of RepeatMasker and prepared the dfam38_full.5.h5 (gunzip from dfam38_full.5.h5.gz) in /home/my_data/biosoft/RepeatMasker/Libraries/famdb/. When I run perl ./configure, it hints as below:

RepeatMasker Configuration Program

Checking for libraries...

   *** No libraries present ***

Choose:
   1. Download minimal Dfam df_version (partition 0) using wget or curl
      (please see https://dfam.org/releases/current/families/FamDB/README.txt for full list of Dfam partitions.)
   2. Exit and download libraries manually (Dfam 3.8 or newer).

it failed to find the database, could you tell me what has happened?

cyycyj commented 7 months ago

my fault. I need to download root partition to meet minimun requirement.

README.txt

cyycyj commented 7 months ago

Dear Robert,

After installing RepeatMasker 4.1.6, I noticed that the RepeatMaskerLib.h5 file appears to be missing. Could you please provide insight into what might have happened? I would like to use famdb.py to buildRMLib like following command:

python famdb.py -i Libraries/RepeatMaskerLib.h5 families -f embl -a -d Cucurbitaceae > Cucurbitaceae.embl   

buildRMLibFromEMBL.pl Cucurbitaceae.embl > Cucurbitaceae.fa

Here are ./Libraries index:

-rwxr-xr-x 1   25K Dec  6 01:02 Artefacts.embl
drwxr-xr-x 4  4.0K Dec  6 16:25 CONS-Dfam_withRBRM_3.8
drwxr-xr-x 2  4.0K Dec  6 16:03 famdb
-rw-r--r-- 1   214 Dec  6 01:02 README.meta
-rw-r--r-- 1  2.9K Oct 27  2018 README.RMRBSeqs
-rw-r--r-- 1   25M Dec  6 01:02 RepeatAnnotationData.pm
-rw-r--r-- 1   97M Dec  6 16:10 RepeatMasker.lib
-rw-r--r-- 1   20K Dec  6 16:10 RepeatMasker.lib.ndb
-rw-r--r-- 1  3.7M Dec  6 16:10 RepeatMasker.lib.nhr
-rw-r--r-- 1  456K Dec  6 16:10 RepeatMasker.lib.nin
-rw-r--r-- 1   583 Dec  6 16:10 RepeatMasker.lib.njs
-rw-r--r-- 1  456K Dec  6 16:10 RepeatMasker.lib.not
-rw-r--r-- 1   25M Dec  6 16:10 RepeatMasker.lib.nsq
-rw-r--r-- 1   16K Dec  6 16:10 RepeatMasker.lib.ntf
-rw-r--r-- 1  152K Dec  6 16:10 RepeatMasker.lib.nto
-rw-r--r-- 1   18M Dec  6 01:02 RepeatPeps.lib
-rw-r--r-- 1   20K Dec  6 16:10 RepeatPeps.lib.pdb
-rw-r--r-- 1  2.8M Dec  6 16:10 RepeatPeps.lib.phr
-rw-r--r-- 1  141K Dec  6 16:10 RepeatPeps.lib.pin
-rw-r--r-- 1   562 Dec  6 16:10 RepeatPeps.lib.pjs
-rw-r--r-- 1  212K Dec  6 16:10 RepeatPeps.lib.pot
-rw-r--r-- 1   16M Dec  6 16:10 RepeatPeps.lib.psq
-rw-r--r-- 1   16K Dec  6 16:10 RepeatPeps.lib.ptf
-rw-r--r-- 1   71K Dec  6 16:10 RepeatPeps.lib.pto
-rw-r--r-- 1  5.5K Dec  6 01:02 RepeatPeps.readme
-rw-r--r-- 1  189M Dec  6 16:05 RMRB.embl
-rw-r--r-- 1   18M Dec  6 01:02 RMRBMeta.embl
-rw-r--r-- 1  175M Oct 27  2018 RMRBSeqs.embl
-rw-r--r-- 1   29K Dec  6 01:02 RMRB_spec_to_tax.json
-rw-r--r-- 1  109M Dec  6 01:02 taxonomy.dat

Here are the installation notice:

Add a Search Engine:
   1. Crossmatch: [ Un-configured ]
   2. RMBlast: [ Configured, Default ]
   3. HMMER3.1 & DFAM: [ Un-configured ]
   4. ABBlast: [ Un-configured ]

   5. Done

Enter Selection: 5
Building FASTA version of RepeatMasker.lib ................................................................
Building RMBlast frozen libraries..
The program is installed with a the following repeat libraries:

FamDB Directory     : /data/biosoft/RepeatMasker/Libraries/famdb
FamDB Generator     : famdb.py v1.0
FamDB Format Version: 1.0
FamDB Creation Date : 2023-11-15 11:30:15.311827

Database: Dfam withRBRM
Version : 3.8
Date    : 2023-11-14

Dfam - A database of transposable element (TE) sequence alignments and HMMs.
RBRM - RepBase RepeatMasker Edition - version 20181026

2 Partitions Present
Total consensus sequences present: 498359
Total HMMs present               : 472219

Partition Details
-----------------
 Partition 0 [dfam38_full.0.h5]: root - Mammalia, Amoebozoa, Bacteria <bacteria>, Choanoflagellata, Rhodophyta, Haptista, Metamonada, Fungi, Sar, Placozoa, Ctenophora <comb jellies>, Filasterea, Spiralia, Discoba, Cnidaria, Porifera, Viruses
     Consensi: 308177, HMMs: 295552

 Partition 1 [ Absent ]: Obtectomera 

 Partition 2 [ Absent ]: Euteleosteomorpha 

 Partition 3 [ Absent ]: Sarcopterygii - Sauropsida, Coelacanthimorpha, Amphibia, Dipnomorpha
 Partition 4 [ Absent ]: Diptera 

 Partition 5 [dfam38_full.5.h5]: Viridiplantae 
     Consensi: 190182, HMMs: 176667

 Partition 6 [ Absent ]: Deuterostomia - Chondrichthyes, Hemichordata, Cladistia, Holostei, Tunicata, Cephalochordata, Cyclostomata <vertebrates>, Osteoglossocephala, Otomorpha, Elopocephalai, Echinodermata, Chondrostei

 Partition 7 [ Absent ]: Hymenoptera 

 Partition 8 [ Absent ]: Ecdysozoa - Nematoda, Gelechioidea, Yponomeutoidea, Incurvarioidea, Chelicerata, Collembola, Polyneoptera, Tineoidea, Apoditrysia, Monocondylia, Strepsiptera, Palaeoptera, Neuropterida, Crustacea, Coleoptera, Siphonaptera, Trichoptera, Paraneoptera, Myriapoda, Scalidophora

Further documentation on the program may be found here:
  /data/biosoft/RepeatMasker/repeatmasker.help

Thank you very much

rmhubley commented 7 months ago

The "RepeatMaskerLib.h5" file has been replaced by the individual HDF5 partitions now located in Libraries/famdb/*.h5. I suspect we missed updating the example documentation somewhere as the famdb option "-i" is used to point the tool to the directory containing the partitions. E.g:

% python famdb.py -i Libraries/famdb families -f embl -a -d Cucurbitaceae > Cucurbitaceae.embl

or simply (using default locations):

% ./famdb.py families -f embl -a -d Cucurbitaceae > Cucurbitaceae.embl
% egrep -c "^ID" Cucurbitaceae.embl
4836

NOTE: The Dfam 3.8 famdb partitions now contain both curated/uncurated families. If you are used to working with only the curated Dfam families, you may want to add the "--curated" flag to famdb to only retrieve those:

% ./famdb.py families -f embl --curated -a -d Cucurbitaceae > Cucurbitaceae.embl
% egrep -c "^ID" Cucurbitaceae.embl
0

which sadly reports that there are no curated TE families available for this clade. RepeatMasker will still (by default) use only curated families from famdb to perform a search. However we have introduced a new RepeatMasker flag ( "-uncurated" ) to support the new combined famdb files and utilize both the curated and uncurated families in a search when requested.

To see where all the uncurated families are coming from, you may wan to check out the lineage command:

% ./famdb.py lineage -a -d Cucurbitaceae
1 root(0) [0]
└─131567 cellular organisms(0) [0]
  └─2759 Eukaryota(0) [0]
    └─33090 Viridiplantae(5) [0]
      └─35493 Streptophyta(5) [0]
        └─131221 Streptophytina(5) [0]
          └─3193 Embryophyta(5) [0]
            └─58023 Tracheophyta(5) [0]
              └─78536 Euphyllophyta(5) [0]
                └─58024 Spermatophyta(5) [0]
                  └─3398 Magnoliopsida(5) [0]
                    └─1437183 Mesangiospermae(5) [0]
                      └─71240 eudicotyledons(5) [0]
                        └─91827 Gunneridae(5) [0]
                          └─1437201 Pentapetalae(5) [0]
                            └─71275 rosids(5) [0]
                              └─91835 fabids(5) [0]
                                └─71239 Cucurbitales(5) [0]
                                  └─3650 Cucurbitaceae(5) [0]
                                    ├─1003877 Benincaseae(5) [0]
                                    │ ├─3653 Citrullus(5) [0]
                                    │ │ └─3654 Citrullus lanatus(5) [1804]
                                    │ └─3655 Cucumis(5) [0]
                                    │   ├─3656 Cucumis melo(5) [1970]
                                    │   └─3659 Cucumis sativus(5) [1062]
                                    └─1003878 Cucurbiteae(5) [0]
                                      └─3660 Cucurbita(5) [0]
                                        ├─3661 Cucurbita maxima(5) [0]
                                        └─3663 Cucurbita pepo(5) [0]

It looks like these came from de-novo runs on Citrullus lanatus (1,804 families), Cucumis melo (1,970 families), and Cucumis sativus (1,062 families). The value following each species/clade in parentheses denotes the famdb partition containing these families.

cyycyj commented 6 months ago

Dear Robert,

Thank you for your kind assistance! I will give it a try, and should there be any new developments or bugs, I would keep you updated.