How are chromosomes labelled by parse_annotations?

Hello there, jtlovell!

Once again, thank you for the software. I'm hoping you can help me understand how "parse_annotations" is labelling chromosomes that go into the 'bed' file. My issue is simple, I'm downloading genomes from NCBI at different stages of annotation / maturity. Some of them, like the Anolis carolinensis genome, work fairly well. Others are a bit more clunky.

My undertanding is that "parseannotations" uses the information from the "faa" and "gff" files it downloads to make the "bed" and protein files that genespace uses to run its analyses. Is that correct? If so, how does genespace determine the number of a chromosome - given that very often chromosome names from NCBI are often formatted as "NC#####.#". I'm particularly surprise to see that genespace is even able to figure out which are the "X" and "Y" chromosomes in the Anolis carolinensis genome - although the fasta and gff files that it downloads from NCBI do not seem to have that information. Could you let me know how is that possible?

The reason I ask is because the naming of chromosomes for the other "less mature" species on NCBI is highly inconsistent, and I want to make the necessary corrections to their bed files for proper plotting. My idea is to use awk to match the protein name in the gff to the scaffold name in the original genome file, and from there get the chromosome name to edit the bed file. However, I'm hopeful that your insight on how genespace is doing its formatting could prevent me going this somewhat cumbersome route.

Thank you very much for your time.

Best,

Pietro

jtlovell / GENESPACE

How are chromosomes labelled by parse_annotations? #151