lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.81k stars 415 forks source link

name is not defined in paftools.js gff2bed #422

Open johnomics opened 5 years ago

johnomics commented 5 years ago

Thank you for all your excellent work on minimap2, we use it every day.

I'm trying to convert the NCBI GRCh38 RefSeq annotation to BED format for aligning with minimap2 using paftools.js gff2bed. As per your advice, I'm using the no_alt_analysis GRCh38, and have got the full_analysis_set GFF and GTF from the same folder:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
ftp://ftp.ncbi.nlm.nih.gov//genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gff.gz
ftp://ftp.ncbi.nlm.nih.gov//genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf.gz

I get the following error when running gff2bed, with the GTF or GFF (minimap2 v2.17 release):

$ paftools.js gff2bed -j GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gtf
/mnt/lustre/groups/biol-tf-2018/software/miniconda3/bin/paftools.js:1593: ReferenceError: name is not defined
            exons.push([t[0], t[3], t[4], t[6], id, type, name, tname]);
                                                 ^
ReferenceError: name is not defined
    at paf_gff2bed (/mnt/lustre/groups/biol-tf-2018/software/miniconda3/bin/paftools.js:1593:50)
    at main (/mnt/lustre/groups/biol-tf-2018/software/miniconda3/bin/paftools.js:2517:29)
    at /mnt/lustre/groups/biol-tf-2018/software/miniconda3/bin/paftools.js:2534:1

The name variable used at line 1593 is set in the if statements at lines 1567 and 1574, but it is not initialised; instead, a gname variable is initialised at line 1562 but does not appear to be used.

If I change the name variable to gname, the command works, but I only ever get N/A for gene names; the NCBI annotations have gene_id and gene, but not gene_name. However, changing gene_name to gene_id or gene, or adding additional else if statements to check for gene_id or gene, doesn't work either.

Please could you look into this? Should I be using a different annotation? Or is there a fix that will include the NCBI gene names? Many thanks.

lh3 commented 5 years ago

Please try the latest paftools. It should have resolved the issue.

johnomics commented 5 years ago

Thanks for the quick response. This works for the GTF, so I can continue with that, but just to let you know, it doesn't work with the GFF (maybe a separate issue?):

$ paftools.js gff2bed -j GCA_000001405.15_GRCh38_full_analysis_set.refseq_annotation.gff
chr1    12227   12612   NR_046018.2|misc_RNA|N/A    1000    +
chr1    12721   13220   NR_046018.2|misc_RNA|N/A    1000    +
chr1    14829   14969   NR_024540.1|misc_RNA|N/A    1000    -
chr1    15038   15795   NR_024540.1|misc_RNA|N/A    1000    -
chr1    15947   16606   NR_024540.1|misc_RNA|N/A    1000    -
chr1    16765   16857   NR_024540.1|misc_RNA|N/A    1000    -
chr1    17055   17232   NR_024540.1|misc_RNA|N/A    1000    -
chr1    17368   17605   NR_024540.1|misc_RNA|N/A    1000    -
chr1    17742   17914   NR_024540.1|misc_RNA|N/A    1000    -
chr1    18061   18267   NR_024540.1|misc_RNA|N/A    1000    -
chr1    18366   24737   NR_024540.1|misc_RNA|N/A    1000    -
chr1    24891   29320   NR_024540.1|misc_RNA|N/A    1000    -
/mnt/lustre/groups/biol-tf-2018/software/miniconda3/bin/paftools.js:1578: Error: No transcript_id
        if (id == null) throw Error("No transcript_id");
                        ^
Error: No transcript_id
    at Error (<anonymous>)
    at paf_gff2bed (/mnt/lustre/groups/biol-tf-2018/software/miniconda3/bin/paftools.js:1578:25)
    at main (/mnt/lustre/groups/biol-tf-2018/software/miniconda3/bin/paftools.js:2518:29)
    at /mnt/lustre/groups/biol-tf-2018/software/miniconda3/bin/paftools.js:2535:1
lh3 commented 5 years ago

Then use GTF. I think NCBI GFF3 is problematic more or less, and is inconsistent with the corresponding GTF. Gencode/ensembl GTF and GFF3 pretty much have the same information.

lh3 commented 5 years ago

I am reopening this issue in case I may come back to it and make further improvement for NCBI GFF3.

niehu2018 commented 5 years ago

Please try the latest paftools. It should have resolved the issue.

I found the GTF of human and mouse from ENSEMBL all have gene_id and gene_name, but some genes of other species (GFF from ENSEMBL) have gene_id attribute, but no gene_name attribute. How did you fix this problem, just ignore these genes which have "gene_id" attribute but not have "gene_name" attribute in the bam file? or use gene_id or something instead of gene_name?

akshayMpatel commented 3 years ago

I am still getting the original "...ReferenceError: name is not defined..." as above with minimap2 2.17-r941 (latest version of paftools.js I assume). I'm trying to use the --junc-bed option and only have the gtf.