vipints / GFFtools-GX

gfftools - Galaxy toolshed repository
BSD 3-Clause "New" or "Revised" License
15 stars 16 forks source link

TypeError: cannot perform reduce with flexible type #9

Open ramadatta opened 7 years ago

ramadatta commented 7 years ago

Hi Vipin,

Thank you very much for the GFFtools scripts. However, when I run the command,

python gff_to_bed.py ref.gff3 > out2.bed

while converting gff3 to bed 12 format, I receive an error below:

Traceback (most recent call last):
  File "gff_to_bed.py", line 115, in <module>
    __main__() 
  File "gff_to_bed.py", line 112, in __main__
    writeBED(Transcriptdb)
  File "gff_to_bed.py", line 86, in writeBED
    score = ent1['score'][0] if ent1['score'].any() else score
  File "/home/.linuxbrew/Cellar/python/2.7.13/lib/python2.7/site-packages/numpy/core/_methods.py", line 38, in _any
    return umr_any(a, axis, dtype, out, keepdims)
TypeError: cannot perform reduce with flexible type

Could I please know how to fix this error?

vipints commented 7 years ago

Hi @ramadatta, Can I have first few lines from your GFF3 input file?

ramadatta commented 7 years ago

Hi Vipin,

Please find it here:

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build ASM185804v2
#!genome-build-accession NCBI_Assembly:GCF_001858045.1
#!annotation-date 5 December 2016
#!annotation-source NCBI Oreochromis niloticus Annotation Release 103
##sequence-region NC_031965.1 1 38372991
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8128
NC_031965.1     RefSeq  region  1       38372991        .       +       .       ID=id0;Dbxref=taxon:8128;Name=LG1;dev-stage=adult;gbkey=Src;genome=chromosome;isolate=F11D_XX;linkage-group=LG1;mol_type=genomic DNA;sex=female;tissue-type=blood
NC_031965.1     Gnomon  gene    11049   13118   .       -       .       ID=gene0;Dbxref=GeneID:102075536;Name=LOC102075536;gbkey=Gene;gene=LOC102075536;gene_biotype=lncRNA
NC_031965.1     Gnomon  lnc_RNA 11049   13118   .       -       .       ID=rna0;Parent=gene0;Dbxref=GeneID:102075536,Genbank:XR_266817.3;Name=XR_266817.3;gbkey=ncRNA;gene=LOC102075536;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=uncharacterized LOC102075536;transcript_id=XR_266817.3
NC_031965.1     Gnomon  exon    13071   13118   .       -       .       ID=id1;Parent=rna0;Dbxref=GeneID:102075536,Genbank:XR_266817.3;gbkey=ncRNA;gene=LOC102075536;product=uncharacterized LOC102075536;transcript_id=XR_266817.3
NC_031965.1     Gnomon  exon    12125   12153   .       -       .       ID=id2;Parent=rna0;Dbxref=GeneID:102075536,Genbank:XR_266817.3;gbkey=ncRNA;gene=LOC102075536;product=uncharacterized LOC102075536;transcript_id=XR_266817.3
NC_031965.1     Gnomon  exon    11049   11689   .       -       .       ID=id3;Parent=rna0;Dbxref=GeneID:102075536,Genbank:XR_266817.3;gbkey=ncRNA;gene=LOC102075536;product=uncharacterized LOC102075536;transcript_id=XR_266817.3
NC_031965.1     Gnomon  gene    31634   150960  .       -       .       ID=gene1;Dbxref=GeneID:100690862;Name=LOC100690862;gbkey=Gene;gene=LOC100690862;gene_biotype=protein_coding

Prakki Rama

vipints commented 7 years ago

@ramadatta the input gff3 file lines looks good to me. From the above error message, I am seeing little difference the current version of the code: https://github.com/vipints/GFFtools-GX/blob/master/gff_to_bed.py#L87. Can you test with this recent version of the script.

ramadatta commented 7 years ago

Hi Vipin,

Thank you for quick replies. I have got the gff3 file from NCBI. The following is the result when I used the latest version of your script:

$ python gff_to_bed_latest.py ref_ASM185804v2_top_level.gff3 >out3.bed
Traceback (most recent call last):
  File "gff_to_bed_latest.py", line 112, in <module>
    __main__() 
  File "gff_to_bed_latest.py", line 109, in __main__
    writeBED(Transcriptdb)
  File "gff_to_bed_latest.py", line 66, in writeBED
    score = ent1['transcript_score'][idx] if ent1['transcript_score'].any() else score ## getting the transcript score 
ValueError: no field of name transcript_score
vipints commented 7 years ago

@ramadatta can you provide me the ncbi link where you downloaded the gff3 file.

ramadatta commented 7 years ago

Hi Vipin,

Here is the annotation file:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Oreochromis_niloticus/GFF/ref_ASM185804v2_top_level.gff3.gz

vipints commented 7 years ago

Hi @ramadatta, I have downloaded the file from the above link. There are many features available in this GFF file. cut -f 2,3 ref_ASM185804v2_top_level.gff3 | sort | uniq -c The file looks good with the features available. Now I found an issue with feature mappings in the file: NW_017615909.1 Gnomon gene 431704 433832 . + . ID=gene36112;Dbxref=GeneID:100700432;Name=LOC100700432;gbkey=Gene;gene=LOC100700432;gene_biotype=pseudogene;pseudo=true NW_017615909.1 Gnomon exon 431704 431773 . + . ID=id778571;Parent=gene36112;Dbxref=GeneID:100700432;gbkey=exon;gene=LOC100700432;model_evidence=Supporting evidence includes similarity to: 4 Proteins NW_017615909.1 Gnomon exon 433129 433832 . + . ID=id778572;Parent=gene36112;Dbxref=GeneID:100700432;gbkey=exon;gene=LOC100700432;model_evidence=Supporting evidence includes similarity to: 4 Proteins

The exon feature is directly mapped to gene, in this mapping there missing a additional layer of mapping exon -> transcript -> gene. I would say as per the GFF specification even if it is a single transcript gene, the mapping description should go as mentioned before. (gene -> transcript -> exon). May be it is good to contact the genome annotation center (ncbi) where you downloaded the file and inform them that the mapping in GFF is not valid.

vipints commented 7 years ago

in BED format we are representing the transcript feature in a single line.

vipints commented 7 years ago

I could do something that I can kick out these non properly matched annotation features. But I m not sure whether this will clean all single exon gene annotation from this file. I think then the resulting bed format will not contain any single exon gene. That will not be a good fix to this problem. Also you cannot define the missing transcript status whether a mRNA, lnRNA, or any other type of RNA feature, this has to be defined by the annotation center.

ramadatta commented 7 years ago

Hi @vipints

That appears true. Some transcripts are missing the status. Any other possible solution to think. May be I will try to approach the NCBI annotation team about this. Thanks so much for your help!