Funannotate exits in predict: bedtools intersect error

felipe797 commented 3 years ago

Hi Jon! First of all, thanks for the pipeline, it's been really useful to me!

So, I ran into this problem in predict, where it's quitting after a bedtools intersect error. From logfile:

[09:39 AM]: Generating protein fasta files from 22,573 EVM models\n [09:39 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).\n [09:39 AM]: CMD ERROR: bedtools intersect -f 0.9 -a\n /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/evm.round1.gff3 -b /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/repeatmasker.bed [09:39 AM]: (None, b'ERROR: Received illegal bin number -1 from getBin call.\nMaximum values is: 2396745\nThis typically means that your coordinates are\nnegative or too large to represent in the data\nstructure bedtools uses to find intersections.ERROR: Unable to add record to tree.\n')

I'm running the new versions of both funannotate (v1.8.2) and bedtools (v2.29.2). After some troubleshooting, I found out bedtools intersect has problems with large scaffolds and the recommendation is, according to their readthedocs:

Note

If you are trying to intersect very large files and are having trouble with excessive memory usage, please presort your data by chromosome and then by start position (e.g., sort -k1,1 -k2,2n in.bed > in.sorted.bed for BED files) and then use the -sorted option. This invokes a memory-efficient algorithm designed for large files. This algorithm has been substantially improved in recent (>=2.18.0) releases.

I proceeded to sort both my repeatmasker.bed and evm.round1.gff3 and run bedtools intersect manually on the sorted new files, using the same command funannotate predict did and same error popped up :

bedtools intersect -f 0.9 -a evm.round1.gff3 -b repeatmasker.sorted.bed        

ERROR: Received illegal bin number -1 from getBin call.
Maximum values is: 2396745
This typically means that your coordinates are
negative or too large to represent in the data
structure bedtools uses to find intersections.ERROR: Unable to add record to tree.

But it ran smoothly in the same command but with the flag -sorted with the sorted files:

bedtools intersect -f 0.9 -a evm.round1.gff3 -b repeatmasker.sorted.bed -sorted

Not sure but maybe a suggestion would be conditional statement to check if infiles are sorted by chromosome and start position and if so run the command with -sorted?

in funannotate/library.py line 5656 :

def validate_tRNA(input, genes, gaps, output): cmd = ['bedtools', 'intersect', '-v', '-a', input, '-b', genes] if gaps: cmd.append(gaps) runSubprocess2(cmd, '.', log, output)

Let me know. Meanwhile, can you think of an alternative way to get around this without editing source? Thank you so much in advance!

nextgenusfs commented 3 years ago

Hi @felipe797. Thanks for reporting. So you are saying that if we just add -sorted to the command that it will run? I can't remember off the top of my head if the inputs are sorted. One other option would be to not use bedtools to do this.

If you wanted to skip that step, you could pass --repeat_filter blast which will then skip this step of bedtools overlap filtering of repetitive gene models.

felipe797 commented 3 years ago

No, if you use -sorted when the files are not actually sorted it complains:

Error: Sorted input specified, but the file evm.round1.gff3~ has the following out of order record scaffold_26 EVM exon 3755 4003 . - . ID=evm.model.scaffold_26.3.exon2;Parent=evm.model.scaffold_26.3

I sorted the files manually using unix sort, but there's also sortBed option in bedtools.

nextgenusfs commented 3 years ago

Right. So let me look at it a little closer, but one issue might be the GFF3 parser after this -- there isn't a way to to maintain the GFF3 structure (gene --> mRNA --> exon --> CDS) with using either sort or sortBed, which then might cause some downstream problems. I could just not use bedtools to filter but rather use python interlap.

felipe797 commented 3 years ago

That is true. I initially thought the problem was just the bed file, which would have been fine with -sort I think interlap would do the job but probably be slower than bedtools for larger datasets ?

nextgenusfs commented 3 years ago

Yeah it might be slower with interlap. I just made it generated sorted input for the repeat filtering step as I'm not using those outputs directly, testing locally and then I can push those changes. I can also write a sorting "properly" with python and keep the desired GFF3 structure. So do you know if then it also dies on the tRNA step? Or likely you haven't gotten there yet I presume because it died on the repeat filtering.

felipe797 commented 3 years ago

Yeah, I didn't get there. I'm running it now with blast in the repeat filtering as you suggested. I'll start an additional run it with the sorted files I generated just to see if there's any weird behaviour. Keep you updated.

nextgenusfs commented 3 years ago

Okay, here is a quick python function I'll use to sort properly, I don't think this will break bedtools sorting as it is just using as a tiebreaker the features and sorting those in the order I want them:

def sortGFFproper(input, output):
    # function to sort GFF3 file but maintain gene, mrna, exon, cds order
    data = []
    features = set()
    comments = []
    with open(input, 'r') as infile:
        for line in infile:
            if line.startswith('\n'):
                continue
            if line.startswith('#'):
                comments.append(line)
                continue
            line = line.rstrip()
            cols = line.split('\t')
            data.append(cols)
            features.add(cols[2])
    # build sort order dictionary for features
    order_map = {'gene': 0, 'mRNA': 1, 'transcript': 2, 'tRNA': 3, 'ncRNA': 4,
                 'rRNA': 5, 'pseudogene': 6, 'five_prime_utr': 7,
                 'five_prime_UTR': 8, 'exon': 9, 'CDS': 10,
                 'three_prime_utr': 11, 'three_prime_UTR': 12}
    idx = len(order_map)
    for x in features:
        if x not in order_map:
            order_map[x] = idx
            idx += 1
    # we can now sort
    sort_data = sorted(data, key=lambda x: (x[0], int(x[3]), order_map[x[2]]))
    # now we can write back out to file
    with open(output, 'w') as outfile:
        for y in comments:
            outfile.write(y)
        for x in sort_data:
            outfile.write('{}\n'.format('\t'.join(x)))

felipe797 commented 3 years ago

This looks great. Thank you so much for the quick reply and solution.

hyphaltip commented 3 years ago

Not sure if gffutils has sort feature like this too?

nextgenusfs commented 3 years ago

@hyphaltip I'm still testing here, it didn't like my first attempt. But if works I think I'll use this sorting on all incoming GFFs as well as sometimes that can be one of the issues. I'll tag this issue when I push the working code.

hyphaltip commented 3 years ago

We were going to try a larger memory allocation for the job but it seems like if we can solve this with sorted it would be more efficient?

nextgenusfs commented 3 years ago

Okay, this passed the test data. Added -sorted to both bedtools calls and sorting with the function above. Let me know if that reduces memory.

hyphaltip commented 3 years ago

I updated on the system @felipe797 so you can re-try and see if it behaves better?

felipe797 commented 3 years ago

@hyphaltip still giving the same error for the two files. I did a little investigation on the evm.round1.gff3 and it looks like it's still not sorted the way bedtools wants it.

tail -2 on the file I sorted manually before and bedtools intersect behaved on:

scaffold_99 EVM exon 115720 116052 . - . ID=evm.model.scaffold_99.34.exon1;Parent=evm.model.scaffold_99.34 scaffold_99 EVM CDS 115720 116052 . - 0 ID=cds.evm.model.scaffold_99.34;Parent=evm.model.scaffold_99.34

while tail -2 on the file generated on my current attempt:

scaffold_558 EVM exon 6650 11401 . + . ID=evm.model.scaffold_558.1.exon1;Parent=evm.model.scaffold_558.1 scaffold_558 EVM CDS 6650 11401 . + 0 ID=cds.evm.model.scaffold_558.1;Parent=evm.model.scaffold_558.1

Which is essentially the same as the file I was having trouble with before. Maybe there's no way we can get the .gff3 to be sorted exactly the way bedtools expect without disrupting its structure.

nextgenusfs commented 3 years ago

Hmm, I used python sorted which should mimic unix sort -- I almost always prefer python natsort but in this case new that would change the scaffold order to what a human would expect (ie 99 before 558). I must not have had enough scaffolds in the test dataset to see this.

We can maybe be more explicit with the bedtools interest -sorted -g genome.file option. Or I can just unix sort them for this particular task.... not really sure why they would be different.

felipe797 commented 3 years ago

In this case I think what's causing trouble is the start position of the contigs. It's probably expecting upstream contigs to come before in the file (ie scaffold 99's CDS starts on 115720 and scaffold 558's on 6650, so the latter should come first).

nextgenusfs commented 3 years ago

I don't think so, its sort -k1,1 -k4,4n first, so that's the scaffolds first and then sorted by start position (which is what I did in python sort, but apparently it treats the scaffold names differently than unix/bedtools sort). It wants them in a specific order. One sec I'm testing just using bedtools sort and then converting back to the way in which I think it should be after. Should have a fix in a few minutes.

nextgenusfs commented 3 years ago

@hyphaltip try that latest, tests pass locally. Now using bedtools sort to sort for the intersection. For the tRNA step it does the same thing but then "properly" sorts those output after so we don't have potential issue with loading into the dictionary structure.

felipe797 commented 3 years ago

@nextgenusfs I gave it a another try with the updated version today and it worked perfectly. Thank you!

nextgenusfs / funannotate

Funannotate exits in predict: bedtools intersect error #522