hoffmangroup / genomedata

The Genomedata format for storing large-scale functional genomics data.
https://genomedata.hoffmanlab.org/
GNU General Public License v2.0
2 stars 1 forks source link

genomedata-load joins fasta contig name with rest of description line #54

Open EricR86 opened 4 years ago

EricR86 commented 4 years ago

Original report (archived issue) by Anonymous.


Hello,

I was recently trying to create a genomedata archive from the GRCh38 fasta here: https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/

The sequence description lines for this fasta look like this: >chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38

When I ran genomedata-load using this file with -s parameter and a signal file, the generated contigs seem to concatenate all of the items in the description line with a colon, and then join them to the chromosome name with an underscore. As such, I get runtime warnings like these: /anaconda3/envs/a/lib/python2.7/site-packages/tables/path.py:157: NaturalNameWarning: object name is not a valid Python identifier: 'chr1_AC:CM000663.2_gi:568336023_LN:248956422_rl:Chromosome_M5:6aef897c3d6ff0c78aff06ac189178dd_AS:GRCh38'; it does not match the pattern ^[a-zA-Z_][a-zA-Z0-9_]*$; you will not be able to use natural naming to access this object; using getattr()

When I look at the resulting genomedata with genomedata-info contigs, I see contigs like this, corroborating the above: chr10_AC:CM000672.2_gi:568336014_LN:133797422_rl:Chromosome_M5:c0eeee7acfdaf31b770a509bdaa6e51a_AS:GRCh38 0 41593521 chr10_AC:CM000672.2_gi:568336014_LN:133797422_rl:Chromosome_M5:c0eeee7acfdaf31b770a509bdaa6e51a_AS:GRCh38 41693521 41916265 chr10_AC:CM000672.2_gi:568336014_LN:133797422_rl:Chromosome_M5:c0eeee7acfdaf31b770a509bdaa6e51a_AS:GRCh38 42066265 133797422 chr11_AC:CM000673.2_gi:568336013_LN:135086622_rl:Chromosome_M5:1511375dc2dd1b633af8cf439ae90cec_AS:GRCh38 0 50821348 chr11_AC:CM000673.2_gi:568336013_LN:135086622_rl:Chromosome_M5:1511375dc2dd1b633af8cf439ae90cec_AS:GRCh38 51078348 54425074

After I edited the fasta to omit the extra metadata in the description lines, the genomedata had the expected contigs. Since then, I also found out about another workaround that creates the genomedata using a chrom.sizes file instead of a fasta. However, this issue was definitely still worth reporting.

Thanks,

Paul

EricR86 commented 4 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


Thanks for the report! If you’re not interested in sequence and only assembly layout we also recommend using AGP files if possible.

Here is the link to the latest assembly on NCBI for hg38. Admittedly it can be a bit difficult to find these files.

EricR86 commented 4 years ago

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


This is going to be triaged as a feature request. It’s likely that the empty space is being parsed incorrectly. So there are two thoughts behind this:

  1. Add an option so that the chromosome name only comes from the first word (to first white space)
  2. Possibly make this option the default in the next major release of Genomedata