jtlovell / GENESPACE

Other
180 stars 24 forks source link

How are chromosomes labelled by parse_annotations? #151

Closed plhm closed 2 months ago

plhm commented 4 months ago

Hello there, jtlovell!

Once again, thank you for the software. I'm hoping you can help me understand how "parse_annotations" is labelling chromosomes that go into the 'bed' file. My issue is simple, I'm downloading genomes from NCBI at different stages of annotation / maturity. Some of them, like the Anolis carolinensis genome, work fairly well. Others are a bit more clunky.

My undertanding is that "parseannotations" uses the information from the "faa" and "gff" files it downloads to make the "bed" and protein files that genespace uses to run its analyses. Is that correct? If so, how does genespace determine the number of a chromosome - given that very often chromosome names from NCBI are often formatted as "NC#####.#". I'm particularly surprise to see that genespace is even able to figure out which are the "X" and "Y" chromosomes in the Anolis carolinensis genome - although the fasta and gff files that it downloads from NCBI do not seem to have that information. Could you let me know how is that possible?

The reason I ask is because the naming of chromosomes for the other "less mature" species on NCBI is highly inconsistent, and I want to make the necessary corrections to their bed files for proper plotting. My idea is to use awk to match the protein name in the gff to the scaffold name in the original genome file, and from there get the chromosome name to edit the bed file. However, I'm hopeful that your insight on how genespace is doing its formatting could prevent me going this somewhat cumbersome route.

Thank you very much for your time.

Best,

Pietro

jtlovell commented 4 months ago

Good question ... its true that some genomes on NCBI require some processing to get into shape. For the standard formatted genomes, there is a entry (flagged "region" in the third field with the attribute "chromosome") with the NC#--->informativeName dictionary. GENESPACE pulls these out and relabels.

chrIDs <- data.table(data.frame(rtracklayer::readGFF(
          filepath = path2gff,
          filter = list(type = "region"),
          tags = c("chromosome"))))