makovalab-psu / AmpliCoNE-tool

AmpliCoNE: Ampliconic Copy Number Estimator
1 stars 1 forks source link

RepeatMasker for other genomes/species #1

Closed rsharris closed 2 years ago

rsharris commented 2 years ago

I'm trying to perform the steps listed under "AmpliCoNE usage with other reference genomes / species". Under step 1, one of the files needed is BED format output from RepeatMasker.

I'm having some trouble to install and run RepeatMasker (that's a different issue). The only description I've found for its output is under "Output / return format" on https://www.repeatmasker.org/webrepeatmaskerhelp.html . (This describes the web-based RepeatMasker server (I guess) but is probably similar to what I would get running it on my own machine.) This says, in part, that "a table annotating the masked sequences as well as a table summarizing the repeat content of the query sequence will be" [produced]. Is one of those the BED file needed for AmpliCoNE? If so, which?

rsharris commented 2 years ago

What I got from running the web server on a short sequence include a .out file containing this (this is just the first few lines).

   SW   perc perc perc  query      position in query     matching   repeat             position in repeat
score   div. del. ins.  sequence   begin end    (left)   repeat     class/family     begin  end    (left)  ID

  904   19.4 11.8  0.8  utig4-342    677   904 (39196) + L1MB8      LINE/L1            5926   6178    (0)   1  
  242   26.9  8.8  4.2  utig4-342   1296  1431 (38669) + MIRb       SINE/MIR            121    262    (6)   2  
 1915   20.1  8.2  2.4  utig4-342   1922  2399 (37701) + MLT1D      LTR/ERVL-MaLR         1    505    (0)   3  
  469   30.1  5.1  0.5  utig4-342   2622  2818 (37282) C LTR33      LTR/ERVL          (292)    223     18   4  
   36    0.0  0.0  0.0  utig4-342   4043  4081 (36019) + (A)n       Simple_repeat         1     39    (0)   5  
  672   26.1  7.4  2.8  utig4-342   4251  4494 (35606) + MIR        SINE/MIR              8    262    (0)   6  
  633   23.4  3.1  0.6  utig4-342   6874  7032 (33068) + MER5A      DNA/hAT-Charlie       1    163   (26)   7  
  533   28.1 15.9  0.4  utig4-342   8939  9170 (30930) + LTR67B     LTR/ERVL            344    611    (9)   8  
 1847   16.0  1.6  0.0  utig4-342  10159 10465 (29635) + AluSx      SINE/Alu              1    312    (0)   9  

There were other output files but this is the only one that makes any sense as being convertible to BED format. I guess I would need to grab columns 5, 6, and 7 as the BED interval (subtracting one from column 6). Maybe column 10 or 11 as BED column 4. Unlcear what else I might need.

I also tried looking for a repeat masker BED file in the UCSC data page for hg38, but no luck. That would show me what's needed, but I didn't find one there. (I did find a TRF BED file there, which resolved similar questions I would have had for the TRF step).

rahulsimham commented 2 years ago

Sorry for the trouble. I had the description wrong and fixed it now. The file you need is the .out file from RepeatMasker output files and not BED format. The amplicone-build step will automatically grab columns 5, 6, and 7 from the .out file to identify the chromosome locations of repeat regions. Thanks for reporting the issue.

rsharris commented 2 years ago

Thumbs up! Thanks!