Error when using the --full-search flag

FlaviaTG commented 9 months ago

Hi,

I have been using dvorfs successfully with exciting results. But when I wanted to try the --full-search, I got a problem with GeneWise with the following error:

Converted HMM databases already present, skipping step.
Windowing genome...
GeneWise outfile already present, skipping step.
Processing GeneWise hits...
Traceback (most recent call last):
  File "/n/home06/ftermignonigarcia/.conda/envs/MAMBA/envs/snakemake2/envs/DVORFS/bin/dvorfs", line 33, in <module>
    sys.exit(load_entry_point('dvorfs==1.0.1', 'console_scripts', 'dvorfs')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/n/home06/ftermignonigarcia/.conda/envs/MAMBA/envs/snakemake2/envs/DVORFS/lib/python3.12/site-packages/dvorfs/dvorfs.py", line 434, in main
    run_dvorfs(args)
  File "/n/home06/ftermignonigarcia/.conda/envs/MAMBA/envs/snakemake2/envs/DVORFS/lib/python3.12/site-packages/dvorfs/dvorfs.py", line 310, in run_dvorfs
    gw_tsv, gw_alidir = process_genewise_output(windowed_fasta, gw_out, windowed=True,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/n/home06/ftermignonigarcia/.conda/envs/MAMBA/envs/snakemake2/envs/DVORFS/lib/python3.12/site-packages/dvorfs/dvorfs.py", line 208, in process_genewise_output
    df, alis = process_genewise(gw_out, sefasta,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/n/home06/ftermignonigarcia/.conda/envs/MAMBA/envs/snakemake2/envs/DVORFS/lib/python3.12/site-packages/dvorfs/process_genewise.py", line 256, in process_genewise
    .sort_values(['query','bits'],ascending=[True,False])
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/n/home06/ftermignonigarcia/.conda/envs/MAMBA/envs/snakemake2/envs/DVORFS/lib/python3.12/site-packages/pandas/core/frame.py", line 6930, in sort_values
    keys = [self._get_label_or_level_values(x, axis=axis) for x in by]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/n/home06/ftermignonigarcia/.conda/envs/MAMBA/envs/snakemake2/envs/DVORFS/lib/python3.12/site-packages/pandas/core/generic.py", line 1844, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'query'

I would like to use the --full-search version. I have tried to use smaller chunks of the genome, databases, and genomes. It seems it is getting more hits than expected than with the presearch but cannot process them?

ilevantis commented 9 months ago

The error is occuring when parsing the genewise output file so there's something unexpected in there that the parsing script has not accounted for.

Given the output you provided it seems you are rerunning an analysis and are using the --keep-workdir option. Have you tried deleting the work directory and running the full search mode from scratch?

If that gave the same error, what does the gw.out file look like inside the work directory - that should be the raw output of genewise?

FlaviaTG commented 9 months ago

I have tried running it in a new directory. The same error persists. The gw.out file is empty without alignments. Could it be the error indicating "no hits"? What difference should I expect by using the presearch VS full-search? If I try a region (with a bed file) where the presearch found a hit but with the full-search, would you expect the same hit?

ilevantis commented 9 months ago

Yes I would expect the same hits. When presearch is enabled, a bed file is generated based on the presearch hits, when a bed file is provided that is used in the same way as the one generated by the presearch, so feel free to make a custom bedfile with smaller or more specific regions to search in.

If presearch is finding hits but the gw.out file is empty when using --full-search then genewise may be having issues and exiting without outputting anything and also not raising an error (which should be caught by python).

It may be that the memory size allotted for genewise is too small or the genome window size is too large for the density of hits and this could be causing the silent error in genewise.

I unfortunately didn't make the parameters for window size and GeneWise memory size configurable via command line, so you would have to go into your local copy of the source code and edit the code to test this out though.

If you navigate to /n/home06/ftermignonigarcia/.conda/envs/MAMBA/envs/snakemake2/envs/DVORFS/lib/python3.12/site-packages/dvorfs/dvorfs.py edit the following lines:

Line 162: ['bedtools', 'makewindows', '-g', dbfafai, '-w', '20000', '-s', '15000'], to ['bedtools', 'makewindows', '-g', dbfafai, '-w', '10000', '-s', '5000'],

Line 294: run_genewise(windowed_fasta, hmm2db, gw_out, threads=args.procs, mem=4e6) to run_genewise(windowed_fasta, hmm2db, gw_out, threads=args.procs, mem=8e6)

This should give genewise ~ 8GB of ram to work with and pass it genome sequence in 10KB chunks instead of 20KB chunks.

Failing any of those potential fixes, is there a chance you would be able to send me the data you are trying to run your analysis on?

FlaviaTG commented 9 months ago

Fantastic! Thanks so much for the guidance. I have done some memory and window size modifications to the script, and now it is running to the end with the whole genome and my costume databases. I would try back checkups with the bed files to corroborate outputs between the presearch and the full-search. I would be more than happy to share some of the data I am using. I am using public genomes from NCBI, PFAM, and costume databases. How would you like me to share that with you? My questions arose because I was expecting some putative EVEs sequences in a specific genome that, with the presearch run, did not give hits, but now, with the full-search, it is giving some hits that seem short (compared to the pre-search on other genomes of sister species) and makes me feel skeptical. Maybe now it is just a matter of finding the correct window size? Should I adjust the window size according to the size of the protein alignment in the PFAM databases? Does that make sense?

ilevantis commented 9 months ago

The longer the window size the better, however GenWise seems to have issues when running with large windows even if lots of RAM is provided.

Roughly how long are the AA sequences you are expecting hits for?

The only issue I can forsee due to window size would be if the EVEs you are searching for have a relatively intact sequence spanning more than 5000bp (the window overlap length) - and even then, the processing step should stitch the multiple hits across the different windows together where sensible.

Although this method is pretty good at finding EVEs, it's not perfect, so it may just be failing to catch these specific EVEs due to peculiarities of their specific sequence. If you have the sequences recovered from a related species I would use blastn with those related species EVE sequences as the query against this genome to see what you can find as that will be the most sensitive way to find closely related sequences.

ilevantis / dvorfs

Error when using the --full-search flag #3