warelab / NAM_annotation

This repo contains the full annotation workflow used to annotate the NAM genomes
0 stars 0 forks source link

NAM genome repeatMask #1

Open weix-cshl opened 3 years ago

weix-cshl commented 3 years ago

Run EG pipeline DNAFeatures_conf to repeatMask NAM genomes and load the repeat features to their core databases.

In 2019, We ran DNAFeatures pipeline to load repeat features to core dbs for 10 NAM lines. At the time we used Wessler-Bennetzen library for customized repeats. (/mnt/grid/ware/hpc/home/data/weix/repeatMask_pipelines/Libraries/Maize/wessler-bennetzen-2015/TE_12-Feb-2015_15-35.fa) The databases were backed up at brie:/scratch/weix/NAMcoreDBs/2019RM/

Now we have a filtered and better repeat library MTEC (/mnt/grid/ware/hpc/home/data/weix/repeatMask_pipelines/Libraries/Maize/shujun-filtered/maizeTE10102014.RMname.nogene) generated by Shujun, which gave more consistent coverage across NAM lines.

What I did was two steps.

  1. Run dna_feature pipelines over the 18 unrepeatMasked NAM lines, the pipeline perform the following analyses

    • Dust
    • TRF
    • repeatMasker with repbase species zea_mays
    • repeatMasking with customized library: NAM TE library MTEC
    1. Redo the customized repeatMasking for the other 9 NAM lines that was repeatMasked in 2019

      To aviod repeating the same analysis, I created a new config file Sharon/modules/Bio/EnsEMBL/EGPipeline/PipeConfig/DNAFeaturesCustom_conf.pm that only performs the customized repeat library analysis, we can reuse the old result for the other 3 analyses. We only replace the customzied repeatMasker.

      init_pipeline.pl Bio::EnsEMBL::EGPipeline::PipeConfig::DNAFeaturesCustom_conf --host bhsqldw1 --port 3306 --user plensembl --pass AudreyII -registry /grid/ware/data/data/weix/data/NAM/registry/$species.reg -pipeline_dir $PIPELINE_DIR -species zea_mays -repeatmasker_library all=/mnt/grid/ware/hpc/home/data/weix/repeatMask_pipelines/Libraries/Maize/shujun-filtered/maizeTE10102014.RMname.nogene -always_use_repbase 0 -no_dust 1 -no_trf 1 -hive_force_init 1

weix-cshl commented 3 years ago

Calculate the repeat coverage and recorded it here

Here is the coverage for the 18 NAM lines already done in the 1st step

  | NAM line | coverage |
  |----------|:---------|
  |zea_maysb73ab10  | 84.80%|
  |zea_maysb97  | 84.60% |
  |zea_mayscml103 | 84.36% |
  |zea_mayscml228 | 83.70% |
  |zea_mayscml277 | 84.26% |
  |zea_mayscml322 | 84.75% |
  |zea_mayscml52    | 83.91%|
  |zea_mayscml69    | 84.50%|
  |zea_mayshp301    | 84.25%|
  |zea_maysil14h    |84.28%|
  |zea_maysm37w|    84.61%|
  |zea_maysmo18w    |84.56%|
  |zea_maysnc358    |84.63%|
  |zea_maysoh43 |84.39%|
  |zea_maysoh7b |84.10%|
  |zea_maysp39  |84.22%|
  |zea_maystx303    |84.42%|
  |zea_maystzi8 |84.23%|