lidaof / eg-react

WashU Epigenome Browser
https://epigenomegateway.wustl.edu
Other
66 stars 29 forks source link

Format RepeatMasker track as in CHM13V2? #345

Closed docmanny closed 5 months ago

docmanny commented 5 months ago

Hello! We've been setting up various genomes on a local instance, and we've hit a small snag. We wanted to have our RepeatMasker track load automatically and resemble the CHM13-V2 RepeatMasker track available by default. Using bigBedToBed we were able to see that it looks like this:

chr1    0   2872    (CTAACC)n   0   +   0   2872    0,0,128 2702    Simple_repeat   undefined   2.9
chr1    2876    4571    TAR1    0   -   2876    4571    0,0,128 5173    Satellite   subtelo 13.9
chr1    4251    4702    LTR60B  0   -   4251    4702    81,183,86   962 LTR ERV1    27.3
chr1    4702    4829    LTR60B  0   -   4702    4829    81,183,86   505 LTR ERV1    20.5
chr1    4832    5432    L1MC3   0   +   4832    5432    140,172,62  1304    LINE    L1  19.6
chr1    5443    5697    MER34C_v    0   +   5443    5697    81,183,86   1403    LTR ERV1    14.7
chr1    5697    5855    L1MC3   0   +   5697    5855    140,172,62  3544    LINE    L1  24.6
chr1    5855    6300    MSTA1   0   +   5855    6300    81,183,86   2442    LTR ERVL-MaLR   14.8
chr1    6300    7301    L1MC3   0   +   6300    7301    140,172,62  3544    LINE    L1  24.6
chr1    7310    7702    L1MC3   0   +   7310    7702    140,172,62  1215    LINE    L1  20.5

So, for our genome, we imitated the structure in our own RepeatMasker.bed file:

SUPER__1    0   16471   (CCCTAA)n   0   +   0   16471   134,134,172 18762   Simple_repeat   undefined   0.1
SUPER__1    16566   16606   (AAGGC)n    0   +   16566   16606   134,134,172 13  Simple_repeat   undefined   16.7
SUPER__1    19159   19621   Helitron10_Myo  0   +   19159   19621   255,102,0   1119    RC  Helitron    25.6
SUPER__1    19670   20299   Helitron10_Myo  0   +   19670   20299   255,102,0   1606    RC  Helitron    34.8
SUPER__1    20386   21241   Helitron10_Myo  0   +   20386   21241   255,102,0   1722    RC  Helitron    35.9
SUPER__1    22969   23014   (CTCC)n 0   +   22969   23014   134,134,172 22  Simple_repeat   undefined   27.1
SUPER__1    23014   23059   (CT)n   0   +   23014   23059   134,134,172 53  Simple_repeat   undefined   0.0
SUPER__1    23059   23088   (CTCC)n 0   +   23059   23088   134,134,172 22  Simple_repeat   undefined   27.1
SUPER__1    23088   23121   (AAACCTCC)n 0   +   23088   23121   134,134,172 35  Simple_repeat   undefined   0.0
SUPER__1    23121   23145   (CTCC)n 0   +   23121   23145   134,134,172 22  Simple_repeat   undefined   27.1

However, when we try to visualize this as the default RepeatMasker track in the genome after converting it to a bigBed, it fails to show any hits, even though loading the same file as a regular bigBed works:

image

I think the problem is probably the track definition. This is the current rmsk.as file we are using, based on our best guess and what we could find online about the original CHMV2 RepeatMasker track:

table rmskV2
"RepeatMasker bigBed structure a la CHM13"
    (
    string chrom;       "Genomic sequence name"
    uint chromStart;    "Start in genomic sequence"
    uint chromEnd;      "End in genomic sequence"
    string name;        "Name of repeat"
    uint score;         "always 0 place holder"
    char[1] strand;     "Relative orientation + or -"
    uint thickStart;    "Start of the repeat"
    uint thickEnd;      "End of the repeat"
    uint reserved;      "Used as itemRgb as of 2004-11-22"
    uint swScore;       "Smith Waterman alignment score"
    string repClass;    "Class of repeat"
    string repFamily;   "Family of repeat"
    float Div;            "Base mismatches"
    )

Would you be able to confirm if the autoSQL file is the problem, or if there's anything that needs to be done in the backend to visualize the RepeatMasker track as desired?

lidaof commented 5 months ago

Hi @docmanny , in your genome config file, for you repeatmasker.bigbed file, can you try both ` type: "rmskv2"andtype: "repeatmasker"` see if either works? otherwise I might need this file (first 100 lines) to trouble shoot more. Thank you!

docmanny commented 5 months ago

That was it! I needed to make two changes to file types:

It would be nice to add some detail to the "Add a new genome" section of the documentation on creating the bigBed file for RepeatMasker - for reference, these were my steps:

  1. Used rm2gff3 to convert my repeatmasker.out file to a GFF3
  2. Used a modified version of your format_gff.py script (format_rmgff.py.txt) to convert this into the BED9+4 format expected by rmskv2
  3. Use bedToBigBed -as rmskv2.as -type=bed9+4 {repeatmasker.bed} {genome.chrom} {repeatmasker.bb} with the rmskv2.as I pasted above
lidaof commented 5 months ago

Perfect! Thank you so much @docmanny. I will definitely update the documentation!