openvax / gtfparse

Parsing tools for GTF (gene transfer format) files
Apache License 2.0
109 stars 25 forks source link

gtfparse.parsing_error.ParsingError: Integer column has NA values in column 3 #19

Open mictadlo opened 4 years ago

mictadlo commented 4 years ago

Hi, I used the following GTF file from Scallop:

chr01_pilon_pilon       scallop transcript      168145  169166  1000    .       .       gene_id "gene.1.0"; transcript_id "gene.1.0.0"; RPKM "9.8016"; cov "27.2554";
chr01_pilon_pilon       scallop exon    168145  169166  1000    .       .       gene_id "gene.1.0"; transcript_id "gene.1.0.0"; exon "1";
chr01_pilon_pilon       scallop transcript      410672  444190  1000    +       .       gene_id "gene.3.0"; transcript_id "gene.3.0.1"; RPKM "0.5936"; cov "1.6506";
chr01_pilon_pilon       scallop exon    410672  411205  1000    +       .       gene_id "gene.3.0"; transcript_id "gene.3.0.1"; exon "1";
chr01_pilon_pilon       scallop exon    411326  411482  1000    +       .       gene_id "gene.3.0"; transcript_id "gene.3.0.1"; exon "2";
chr01_pilon_pilon       scallop exon    444085  444190  1000    +       .       gene_id "gene.3.0"; transcript_id "gene.3.0.1"; exon "3";
chr01_pilon_pilon       scallop transcript      678431  681781  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.1"; RPKM "0.4733"; cov "1.3162";
chr01_pilon_pilon       scallop exon    678431  678822  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.1"; exon "1";
chr01_pilon_pilon       scallop exon    678917  679068  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.1"; exon "2";
chr01_pilon_pilon       scallop exon    680616  681781  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.1"; exon "3";
chr01_pilon_pilon       scallop transcript      678431  685851  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; RPKM "7.9200"; cov "22.0230";
chr01_pilon_pilon       scallop exon    678431  678822  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; exon "1";
chr01_pilon_pilon       scallop exon    678917  679068  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; exon "2";
chr01_pilon_pilon       scallop exon    680616  681627  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; exon "3";
chr01_pilon_pilon       scallop exon    681874  681965  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; exon "4";
chr01_pilon_pilon       scallop exon    683200  683274  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; exon "5";
chr01_pilon_pilon       scallop exon    683744  683805  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; exon "6";
chr01_pilon_pilon       scallop exon    684640  684709  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; exon "7";
chr01_pilon_pilon       scallop exon    684842  685851  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.2"; exon "8";
chr01_pilon_pilon       scallop transcript      681834  685851  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.3"; RPKM "1.1177"; cov "3.1080";
chr01_pilon_pilon       scallop exon    681834  681965  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.3"; exon "1";
chr01_pilon_pilon       scallop exon    683200  683274  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.3"; exon "2";
chr01_pilon_pilon       scallop exon    683744  683805  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.3"; exon "3";
chr01_pilon_pilon       scallop exon    684640  684709  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.3"; exon "4";
chr01_pilon_pilon       scallop exon    684842  685851  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.3"; exon "5";
chr01_pilon_pilon       scallop transcript      683032  685851  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.0"; RPKM "0.6001"; cov "1.6686";
chr01_pilon_pilon       scallop exon    683032  683274  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.0"; exon "1";
chr01_pilon_pilon       scallop exon    683744  683805  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.0"; exon "2";
chr01_pilon_pilon       scallop exon    684640  684709  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.0"; exon "3";
chr01_pilon_pilon       scallop exon    684842  685851  1000    +       .       gene_id "gene.6.0"; transcript_id "gene.6.0.0"; exon "4";

However, the belowe code caused the following errors:

Traceback (most recent call last):
  File "/anaconda/envs/bioinf-scripts/lib/python3.7/site-packages/gtfparse/read_gtf.py", line 103, in parse_gtf
    for df in chunk_iterator:
  File "/anaconda/envs/bioinf-scripts/lib/python3.7/site-packages/pandas/io/parsers.py", line 1128, in __next__
    return self.get_chunk()
  File "/anaconda/envs/bioinf-scripts/lib/python3.7/site-packages/pandas/io/parsers.py", line 1188, in get_chunk
    return self.read(nrows=size)
  File "/anaconda/envs/bioinf-scripts/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
    ret = self._engine.read(nrows)
  File "/anaconda/envs/bioinf-scripts/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 908, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 973, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1105, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1136, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1232, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Integer column has NA values in column 3

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/lorencm/projects/bioinf-scripts/reduceIsoforms.py", line 24, in <module>
    df = read_gtf("data/hybrid.gtf")
  File "/anaconda/envs/bioinf-scripts/lib/python3.7/site-packages/gtfparse/read_gtf.py", line 211, in read_gtf
    restrict_attribute_columns=usecols)
  File "/anaconda/envs/bioinf-scripts/lib/python3.7/site-packages/gtfparse/read_gtf.py", line 154, in parse_gtf_and_expand_attributes
    features=features)
  File "/anaconda/envs/bioinf-scripts/lib/python3.7/site-packages/gtfparse/read_gtf.py", line 122, in parse_gtf
    raise ParsingError(str(e))
gtfparse.parsing_error.ParsingError: Integer column has NA values in column 3

This is the code:

df = read_gtf("data/hybrid.gtf")
df_genes = df[df["feature"] == "gene"]
print(df_genes)

What did I miss?

Thank you in advance,

Michal