lh3 / miniprot

Align proteins to genomes with splicing and frameshift
https://lh3.github.io/miniprot/
MIT License
310 stars 16 forks source link

Is it necessary to shield repetitive sequences? #49

Closed XXH123a closed 11 months ago

XXH123a commented 11 months ago

Hi, Professor Li, excuse me

  1. should there be no difference between soft shielding and unshielded miniprot?
  2. if hard masking is used, some CDs sequences extracted from GFF files contain about 10 to 100 N, how should I deal with these CDs sequences containing N?
  3. how should I screen the extracted CDs sequences whether or not the repetitive sequences are shielded? For example, if the protein length is less than 50 amino acids, discard all or other standards? The following is a CdS sequence extracted from the GFF file by miniprot annotation using proteins of the same species. The length of the protein translated by seqkit and gffreed also made me a little confused? 0945437d1960a34ee7fdac040762c75 aa73200dba654850b502b00e5741620
lh3 commented 11 months ago

Repeat masking will affect the alignment of some proteins as you showed. I don't know whether that is a positive or negative effect overall. You have to do a research by yourself.