jtlovell / GENESPACE

Other
188 stars 26 forks source link

Parsing error; n unique sequences = 7034, n matched to gff = 0 #143

Closed kcl58759 closed 4 months ago

kcl58759 commented 7 months ago

Hi there! I am a new user so I may be missing something obvious but I am having trouble with parsing a gff3 file. Here is my code: parse_annotations(rawGenomeRepo=genomeRepo, genomeDirs="E_festucae", genomeIDs = "E_festucae7", gffString = "gff3", faString = "faa", genespaceWd=wd, troubleShoot = TRUE, headerEntryIndex = 1, overwrite = F, headerSep=" ", gffIdColumn = "ID")

Here is the parsing error:


### first 6 gff lines after parsing ... 
    seqid  source       type start   end score strand phase                id
   <fctr>  <fctr>     <fctr> <int> <int> <num> <char> <int>            <char>
1: Chr.01 Genbank       gene  6377  9750    NA      +    NA      C2857_000007
2: Chr.01 Genbank transcript  6377  9750    NA      +    NA nbis-transcript-1
3: Chr.01 Genbank       exon  6377  7226    NA      +    NA       nbis-exon-1
4: Chr.01 Genbank       exon  7600  7855    NA      +    NA       nbis-exon-2
5: Chr.01 Genbank       exon  7904  8144    NA      +    NA       nbis-exon-3
6: Chr.01 Genbank       exon  8207  8352    NA      +    NA       nbis-exon-4

### first 6 bed lines after full parsing (and potential chr re-name)
Empty data.table (0 rows and 4 cols): chr,start,end,id
E_festucae7: n unique sequences = 7034, n matched to gff = 0
                                                                      gffFileIn
                                                                         <char>
1: /Users/kendalllee/Documents/Epichloe_Annotations/E_festucae/E_festucae7.gff3
                                                                             faFileIn
                                                                               <char>
1: /Users/kendalllee/Documents/Epichloe_Annotations/E_festucae/E_festucae_protein.faa
                                                             bedFileOut
                                                                 <char>
1: /Users/kendalllee/Documents/Epichloe_Annotations/bed/E_festucae7.bed
                                                                 faFileOut
                                                                    <char>
1: /Users/kendalllee/Documents/Epichloe_Annotations/peptide/E_festucae7.fa

I am unsure why it will parse at first and then is empty. Any help is much appreciated!

tallnuttrbgv commented 7 months ago

Hi, I had a lot of problems parsing and ended up making my own bed and peptide files - just put them in the same dir as you run genespace and make sure names of genes are identical in bed and fasta files (fastas must be called .fa). You can make a bed by cut -f1,3,4,8 your.gff > your.bed. You might have to change the cut command depending on gtf/gff/gff3 structure. Also I could not parse the bed if it had 'cds', 'exon' etc. I had to grep -P "\tgene\t" file.bed > genes.bed to only get the genes. I think this is sufficient for large scale synteny investigations.

Hope that helps.

kcl58759 commented 7 months ago

Hi, I had a lot of problems parsing and ended up making my own bed and peptide files - just put them in the same dir as you run genespace and make sure names of genes are identical in bed and fasta files (fastas must be called .fa). You can make a bed by cut -f1,3,4,8 your.gff > your.bed. You might have to change the cut command depending on gtf/gff/gff3 structure. Also I could not parse the bed if it had 'cds', 'exon' etc. I had to grep -P "\tgene\t" file.bed > genes.bed to only get the genes. I think this is sufficient for large scale synteny investigations.

Hope that helps.

Thanks so much!