Open EricR86 opened 4 years ago
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
Thanks for the report! If you’re not interested in sequence and only assembly layout we also recommend using AGP files if possible.
Here is the link to the latest assembly on NCBI for hg38. Admittedly it can be a bit difficult to find these files.
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
This is going to be triaged as a feature request. It’s likely that the empty space is being parsed incorrectly. So there are two thoughts behind this:
Original report (archived issue) by Anonymous.
Hello,
I was recently trying to create a genomedata archive from the GRCh38 fasta here: https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/
The sequence description lines for this fasta look like this:
>chr1 AC:CM000663.2 gi:568336023 LN:248956422 rl:Chromosome M5:6aef897c3d6ff0c78aff06ac189178dd AS:GRCh38
When I ran
genomedata-load
using this file with-s
parameter and a signal file, the generated contigs seem to concatenate all of the items in the description line with a colon, and then join them to the chromosome name with an underscore. As such, I get runtime warnings like these:/anaconda3/envs/a/lib/python2.7/site-packages/tables/path.py:157: NaturalNameWarning: object name is not a valid Python identifier: 'chr1_AC:CM000663.2_gi:568336023_LN:248956422_rl:Chromosome_M5:6aef897c3d6ff0c78aff06ac189178dd_AS:GRCh38'; it does not match the pattern ^[a-zA-Z_][a-zA-Z0-9_]*$; you will not be able to use natural naming to access this object; using getattr()
When I look at the resulting genomedata with
genomedata-info contigs
, I see contigs like this, corroborating the above:chr10_AC:CM000672.2_gi:568336014_LN:133797422_rl:Chromosome_M5:c0eeee7acfdaf31b770a509bdaa6e51a_AS:GRCh38 0 41593521 chr10_AC:CM000672.2_gi:568336014_LN:133797422_rl:Chromosome_M5:c0eeee7acfdaf31b770a509bdaa6e51a_AS:GRCh38 41693521 41916265 chr10_AC:CM000672.2_gi:568336014_LN:133797422_rl:Chromosome_M5:c0eeee7acfdaf31b770a509bdaa6e51a_AS:GRCh38 42066265 133797422 chr11_AC:CM000673.2_gi:568336013_LN:135086622_rl:Chromosome_M5:1511375dc2dd1b633af8cf439ae90cec_AS:GRCh38 0 50821348 chr11_AC:CM000673.2_gi:568336013_LN:135086622_rl:Chromosome_M5:1511375dc2dd1b633af8cf439ae90cec_AS:GRCh38 51078348 54425074
After I edited the fasta to omit the extra metadata in the description lines, the genomedata had the expected contigs. Since then, I also found out about another workaround that creates the genomedata using a chrom.sizes file instead of a fasta. However, this issue was definitely still worth reporting.
Thanks,
Paul