vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

This error occurs because there is something wrong with the gff file. #4256

Closed 08li20 closed 2 months ago

08li20 commented 3 months ago

Run the command as follows:vg autoindex --threads 2 --workflow mpmap --prefix cattle --ref-fasta Cattle_ARS-UCD2.0_GCF_002263795.3_rename.fa --vcf cattle.variants.vcf --tx-gff cattle.trans.gff

error:[IndexRegistry] contig 1 RefSeq region 1 158534110 . + . ID=NC_037328.1:1..158534110;Dbxref=taxon:9913;Name=1;breed=Hereford;chromosome=1;gbkey=Src;genome=chromosome;isolate=L1;Dominette;01449;registration;number;42190680;mol_type=genomic;DNA;sex=female;tissue-type=left;lung from GTF/GFF cattle.trans.gff is not found in reference

jeizenga commented 3 months ago

It looks to me like the entire GFF line is being parsed as its contig name. Maybe the GFF you have is space-separated instead of tab-separated?

08li20 commented 3 months ago

After converting the delimiter character in the gff file to tab character, an error message is still reported saying that it cannot be found in the reference sequence. vg autoindex --threads 2 --workflow mpmap --prefix cattle --ref-fasta Cattle_ARS-UCD2.0_GCF_002263795.3_rename.fa --vcf cattle.variants.vcf --tx-gff cattle.gff error:[IndexRegistry] contig 1 from GTF/GFF cattle.gff is not found in reference

08li20 commented 3 months ago

Will the lack of comment lines in the gff file affect the matching?

jeizenga commented 3 months ago

No, it would not. My guess is that it's most likely a "1" vs "chr1" mismatch. If not that, then some other mismatched representation. You can get a quick look at the sequence names in the FASTA with grep ">" ref.fa, which will probably make the source of the error obvious.

08li20 commented 3 months ago

I changed the chromosome names in the reference genome and the chromosome names in the gff file to be the same, but the same error still occurred. vg autoindex --threads 2 --workflow mpmap --prefix cattle --ref-fasta Cattle_ARS-UCD2.0_GCF_002263795.3_rename.fa --vcf cattle.variants.vcf --tx-gff cattle.gff2 error:[IndexRegistry] contig >1 from GTF/GFF cattle.gff2 is not found in reference

jeizenga commented 3 months ago

The GFF now appears to have the > from the FASTA name line inserted into the contig name, so I think this was probably a move in the wrong direction. Can you copy the output of these commands? Then I can probably be more specific.

grep ">" Cattle_ARS-UCD2.0_GCF_002263795.3_rename.fa | head

and

head cattle.gff2
08li20 commented 3 months ago

After I modified the gff file, I ran the following command, but ERROR: Tag "transcript_id" not found in attributes (line 145).ERROR: Tag "transcript_id" not found in attributes (line 4).ERROR: No transcripts parsed (remember to set feature type "-y" in vg rna or "-f" in vg autoindex error, here is how to change the -f parameter in vg autoindex specifically vg autoindex --threads 5 --workflow mpmap --prefix cattle --ref-fasta Cattle_ARS-UCD2.0_GCF_002263795.3_rename.fa --vcf cattle.variants.vcf --tx-gff cattlecattle.gff5

ERROR: Tag "transcript_id" not found in attributes (line 145). ERROR: Tag "transcript_id" not found in attributes (line 4). ERROR: No transcripts parsed (remember to set feature type "-y" in vg rna or "-f" in vg autoindex)

jeizenga commented 3 months ago

Typically, a GFF file will include a unique identifier for each transcript as annotations in column 9. Often it's an accession number from a public database. Different genome annotation projects use different labels for the identifier, so you have to specify which one is the unique identifier using the --gff-tx-tag argument. The default transcript_id is what's used by GENCODE, but you'll have to figure out what the label is in your data set.

08li20 commented 3 months ago

The command reported an error saying that transcript_id was not detected in line 145 of the gff file, but line 145 of my file was commented as CDS, and I added the parameter --gff-feature exon to only recognize the exon in the third line of the gff file. vg autoindex --threads 5 --gff-feature exon --gff-tx-tag transcript_id --workflow mpmap --prefix cattle --ref-fasta Cattle_ARS-UCD2.0_GCF_002263795.3_rename.fa --vcf cattle.variants.vcf --tx-gff cattlecattle.gff5 ERROR: Tag "transcript_id" not found in attributes (line 145).

jeizenga commented 3 months ago

Can you send the GFF file you're using?

08li20 commented 2 months ago

sam.gff.docx This is the content of the first two hundred lines of the gff file

jeizenga commented 2 months ago

Closing here since you have opened the same thing as a separate issue at https://github.com/vgteam/vg/issues/4264