human-pangenomics / hpp_pangenome_resources

85 stars 3 forks source link

Seqnames in GRCh38 Graph (minigraph-cactus) to match gene annotation #28

Open pclavell opened 2 months ago

pclavell commented 2 months ago

Hello, I am running vg autoindex to splice the minigraph-cactus full pangenome according to GENCODE v44 gene annotations in order to map RNA-seq reads. I have two questions: 1) By running the following command I receive a below shown error: vg autoindex \ --workflow mpmap \ --prefix data/00_autoindex/splicedpangenome \ --gfa /gpfs/projects/bsc83/Data/assemblies/pangenome/minigraph_cactus/hprc-v1.1-mc-grch38.full.gfa \ --tx-gff /gpfs/projects/bsc83/Data/gene_annotations/gencode/v44/modified/gencode.v44.chr_patch_hapl_scaff.annotation_chr2GRCh38#chr.gtf \ --tmp-dir temporary \ --threads 112 \ --verbosity 2 Error: Saving GBWT and GBWTGraph to temporary/vg-ikdYP8/dir-MgGI5j/d0cc1cf507d88bdebe898d1ba90127a241a83700.gbz [IndexRegistry]: Adding splice junctions to GBZ-format graph. ERROR: Chromosome path "chr1" not found in graph or haplotypes index (line 6).

When I first saw this I thought that it was the typical error where chromosomes are differently formatted (chr1 or 1) so I looked in the minigraph-cactus reference and found SN:Z:GRCh38#chr1 so I changed the seqnames in the gene annotation from chr1 to GRCh38#chr1 but still I keep getting the same error. Which seqnames is this pangenome reference using?

2) As GENCODE v44 annotation is built on GRCh38.p14 I am wondering if it is compatible with the minigraph-cactus pangenome references you built.

Thanks

glennhickey commented 2 months ago

attn: @jeizenga

jeizenga commented 2 months ago

Are you able to share the GTF that you were using? Even the first few hundred lines would probably be sufficient.

pclavell commented 2 months ago

You can download it from this link (obtained from the gencode webpage): https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.chr_patch_hapl_scaff.annotation.gtf.gz

ldammer commented 1 month ago

Hello,

I was wondering if you found a solution to this issue. I'm getting the same error and I tried multiple annotations, such as the Gencode one mentioned here, as well as annotations from ncbi and ucsc. https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/

In all cases the code crashes with the same error mentioned above

pclavell commented 1 month ago

Hello, no, I couldn't solve it. I am waiting for the developers answer.

jeizenga commented 1 month ago

Hi, apologies for the delay--my union has been on strike and I'm only just returning to work. TLDR you can prepend GRCh38#0# to the contig names in the GTF using sed, and it should then run through.

The GFA you're pointing to stores the reference genome as a particular "sample" alongside other samples that have identifiers like HG0xxxx. The combination of a sample+haplotype+contig is specified using the PanSN naming specification, which look something like this:

GRCh38#0#chr1

The first field is the sample identifier (GRCh38), the second is the haplotype (0, which is somewhat redundant for references that don't have a diplotype), and the third is the contig (chr1).