Closed felipe797 closed 3 years ago
Hi @felipe797. Thanks for reporting. So you are saying that if we just add -sorted
to the command that it will run? I can't remember off the top of my head if the inputs are sorted. One other option would be to not use bedtools to do this.
If you wanted to skip that step, you could pass --repeat_filter blast
which will then skip this step of bedtools overlap filtering of repetitive gene models.
No, if you use -sorted
when the files are not actually sorted it complains:
Error: Sorted input specified, but the file evm.round1.gff3~ has the following out of order record scaffold_26 EVM exon 3755 4003 . - . ID=evm.model.scaffold_26.3.exon2;Parent=evm.model.scaffold_26.3
I sorted the files manually using unix sort
, but there's also sortBed
option in bedtools.
Right. So let me look at it a little closer, but one issue might be the GFF3 parser after this -- there isn't a way to to maintain the GFF3 structure (gene --> mRNA --> exon --> CDS) with using either sort
or sortBed
, which then might cause some downstream problems. I could just not use bedtools
to filter but rather use python interlap
.
That is true.
I initially thought the problem was just the bed file, which would have been fine with -sort
I think interlap
would do the job but probably be slower than bedtools for larger datasets ?
Yeah it might be slower with interlap. I just made it generated sorted input for the repeat filtering step as I'm not using those outputs directly, testing locally and then I can push those changes. I can also write a sorting "properly" with python and keep the desired GFF3 structure. So do you know if then it also dies on the tRNA step? Or likely you haven't gotten there yet I presume because it died on the repeat filtering.
Yeah, I didn't get there. I'm running it now with blast in the repeat filtering as you suggested. I'll start an additional run it with the sorted files I generated just to see if there's any weird behaviour. Keep you updated.
Okay, here is a quick python function I'll use to sort properly, I don't think this will break bedtools sorting as it is just using as a tiebreaker the features and sorting those in the order I want them:
def sortGFFproper(input, output):
# function to sort GFF3 file but maintain gene, mrna, exon, cds order
data = []
features = set()
comments = []
with open(input, 'r') as infile:
for line in infile:
if line.startswith('\n'):
continue
if line.startswith('#'):
comments.append(line)
continue
line = line.rstrip()
cols = line.split('\t')
data.append(cols)
features.add(cols[2])
# build sort order dictionary for features
order_map = {'gene': 0, 'mRNA': 1, 'transcript': 2, 'tRNA': 3, 'ncRNA': 4,
'rRNA': 5, 'pseudogene': 6, 'five_prime_utr': 7,
'five_prime_UTR': 8, 'exon': 9, 'CDS': 10,
'three_prime_utr': 11, 'three_prime_UTR': 12}
idx = len(order_map)
for x in features:
if x not in order_map:
order_map[x] = idx
idx += 1
# we can now sort
sort_data = sorted(data, key=lambda x: (x[0], int(x[3]), order_map[x[2]]))
# now we can write back out to file
with open(output, 'w') as outfile:
for y in comments:
outfile.write(y)
for x in sort_data:
outfile.write('{}\n'.format('\t'.join(x)))
This looks great. Thank you so much for the quick reply and solution.
Not sure if gffutils has sort feature like this too?
@hyphaltip I'm still testing here, it didn't like my first attempt. But if works I think I'll use this sorting on all incoming GFFs as well as sometimes that can be one of the issues. I'll tag this issue when I push the working code.
We were going to try a larger memory allocation for the job but it seems like if we can solve this with sorted it would be more efficient?
Okay, this passed the test data. Added -sorted
to both bedtools calls and sorting with the function above. Let me know if that reduces memory.
I updated on the system @felipe797 so you can re-try and see if it behaves better?
@hyphaltip still giving the same error for the two files. I did a little investigation on the evm.round1.gff3
and it looks like it's still not sorted the way bedtools wants it.
tail -2
on the file I sorted manually before and bedtools intersect
behaved on:
scaffold_99 EVM exon 115720 116052 . - . ID=evm.model.scaffold_99.34.exon1;Parent=evm.model.scaffold_99.34 scaffold_99 EVM CDS 115720 116052 . - 0 ID=cds.evm.model.scaffold_99.34;Parent=evm.model.scaffold_99.34
while tail -2
on the file generated on my current attempt:
scaffold_558 EVM exon 6650 11401 . + . ID=evm.model.scaffold_558.1.exon1;Parent=evm.model.scaffold_558.1 scaffold_558 EVM CDS 6650 11401 . + 0 ID=cds.evm.model.scaffold_558.1;Parent=evm.model.scaffold_558.1
Which is essentially the same as the file I was having trouble with before. Maybe there's no way we can get the .gff3 to be sorted exactly the way bedtools expect without disrupting its structure.
Hmm, I used python sorted
which should mimic unix sort
-- I almost always prefer python natsort
but in this case new that would change the scaffold order to what a human would expect (ie 99 before 558). I must not have had enough scaffolds in the test dataset to see this.
We can maybe be more explicit with the bedtools interest -sorted -g genome.file
option. Or I can just unix sort them for this particular task.... not really sure why they would be different.
In this case I think what's causing trouble is the start position of the contigs. It's probably expecting upstream contigs to come before in the file (ie scaffold 99's CDS starts on 115720 and scaffold 558's on 6650, so the latter should come first).
I don't think so, its sort -k1,1 -k4,4n
first, so that's the scaffolds first and then sorted by start position (which is what I did in python sort, but apparently it treats the scaffold names differently than unix/bedtools sort). It wants them in a specific order. One sec I'm testing just using bedtools sort
and then converting back to the way in which I think it should be after. Should have a fix in a few minutes.
@hyphaltip try that latest, tests pass locally. Now using bedtools sort
to sort for the intersection. For the tRNA step it does the same thing but then "properly" sorts those output after so we don't have potential issue with loading into the dictionary structure.
@nextgenusfs I gave it a another try with the updated version today and it worked perfectly. Thank you!
Hi Jon! First of all, thanks for the pipeline, it's been really useful to me!
So, I ran into this problem in predict, where it's quitting after a bedtools intersect error. From logfile:
[09:39 AM]: Generating protein fasta files from 22,573 EVM models\n [09:39 AM]: now filtering out bad gene models (< 50 aa in length, transposable elements, etc).\n [09:39 AM]: CMD ERROR: bedtools intersect -f 0.9 -a\n /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/evm.round1.gff3 -b /bigdata/stajichlab/fmoreira/Metarhizium_acridum/annotate/New_polish/polished/predict_misc/repeatmasker.bed [09:39 AM]: (None, b'ERROR: Received illegal bin number -1 from getBin call.\nMaximum values is: 2396745\nThis typically means that your coordinates are\nnegative or too large to represent in the data\nstructure bedtools uses to find intersections.ERROR: Unable to add record to tree.\n')
I'm running the new versions of both funannotate (v1.8.2) and bedtools (v2.29.2). After some troubleshooting, I found out bedtools intersect has problems with large scaffolds and the recommendation is, according to their readthedocs:
I proceeded to sort both my
repeatmasker.bed
andevm.round1.gff3
and run bedtools intersect manually on the sorted new files, using the same command funannotate predict did and same error popped up :But it ran smoothly in the same command but with the flag
-sorted
with the sorted files:bedtools intersect -f 0.9 -a evm.round1.gff3 -b repeatmasker.sorted.bed -sorted
Not sure but maybe a suggestion would be conditional statement to check if infiles are sorted by chromosome and start position and if so run the command with -sorted?
in
funannotate/library.py
line 5656 :def validate_tRNA(input, genes, gaps, output): cmd = ['bedtools', 'intersect', '-v', '-a', input, '-b', genes] if gaps: cmd.append(gaps) runSubprocess2(cmd, '.', log, output)
Let me know. Meanwhile, can you think of an alternative way to get around this without editing source? Thank you so much in advance!