ropensci / phylotaR

An automated pipeline for retrieving orthologous DNA sequences from GenBank in R
https://docs.ropensci.org/phylotaR
Other
23 stars 8 forks source link

problem with the restez/phylota integration #49

Closed DomBennett closed 4 years ago

DomBennett commented 4 years ago

Issue(s) resolved via email ....


On Tue, Jun 9, 2020 at 8:20 PM Alexandre Pedro <apselvatti@gmail.com> wrote:

Dear Dominc

I'm a Post-Doc at the Federal University of Rio de Janeiro working with birds
and vertebrate macroevolution and biogeography in general. I really appreciate
your efforts to create and maintain the new PhylotaR, and I'm absolutely sure
this will be one the most useful phylogenetic tools to work with GenBank data
from now on.

I'm having a problem with the restez/phylota integration, not sure why. I really
tried to solve everything without having to send you this, but decided to do so
after the program told me to contact the package maintainer:

Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) : 
Unexpected Error in if (ps[["multiple_ids"]]) { : argument is of length zero

Occurred [2020-06-09 19:57:45]
Contact package maintainer for help.

I'm trying to do a PhylotaR search on the Araceae family (txid 4454), which is
taking too long to run through the PhylotaR alone, so I tried the restez/phylotaR
integration. It took quite a while to download the plant database from GB, but I
apparently managed to  create the restez database (I can access with the
connect/disconnect restez commands). However, I'm stuck on what seems to
be a makeblastdb error...I get the following error message when trying to setp:

setup(wd = wd, txid = txid, ncbi_dr=ncbi_dr, v=T)
 -------------------------------------------------
phylotaR: Implementation of PhyLoTa in R [v1.2.0]
-------------------------------------------------
Checking for valid NCBI BLAST+ Tools ...
Found: [/usr/local/ncbi/blast/bin/makeblastdb]
Found: [/usr/local/ncbi/blast/bin/blastn]
. . Running makeblastdb
Error: makeblastdb failed to run. Check BLAST log files.

and the following log file in the recently created blast folder:

BLAST options error: Please provide a database name using -out

I'm using:
blast 2.9
Rstudio version 1.2.1335 with R version 3.6
Mac OS Mojave 10.14.6

Thank you so much for your time and patience.

Please let me know if you need further information.

Best regards

Alexandre 
On Tue, Jun 9, 2020 at 8:23 PM Alexandre Pedro <apselvatti@gmail.com> wrote:

Sorry, forgot to say the phylota version is 1.2.0 and restez is 1.0.2
Den ons 10 juni 2020 09:03Alexandre Pedro <apselvatti@gmail.com> skrev:

Dear Dominic,

I've solved the issue with makeblastdb (I had two conflicting versions in my R $path). But nevertheless, I'm still unable to run the PhylotaR with my restez database. I get the following message:

Error in stages_run(wd = wd, frm = 1, to = nstages, stgs_msg = stgs_msg) : 
Unexpected Error in .local(conn, statement, ...) : 
Unable to execute statement 'SELECT accession,raw_record,raw_sequence FROM nucleotide WHERE accession IN ('MN099113','MN099112','...'.
Server says 'MALException:mat.pack:HY001!Could not allocate space'.

Occurred [2020-06-10 03:52:57]
Contact package maintainer for help.

I know the Araceae family is a large group, but I can't run even the Bromeliaceae
example from the PhylotaR webpage. I have over 100Gb free in my HD; 16G of RAM
and 4 i7 CPUs.

I'm sending you the log file as well just to make sure I'm not missing anything in the setup settings.

Thank you so very much for your time and patience.

Best

Alex
On Wed, Jun 10, 2020 at 4:28 AM Dominic John Bennett <dominic.john.bennett@gmail.com> wrote:

Hi Alex,

I'm sorry you're experiencing problems.

That looks to me like an SQL type error. It's probably due to MN099113 ...
Having larger records than can be allocated in memory. It's not impossible
that the selected records combined compressed size is greater than 16gb or
whatever your OS limits the request to be.

I see two possible workarounds... 

First, you could try re-running phylotar with a much lower batch size thereby
reducing the number of records to pull out of the database each time.

Second, you could try excluding the largest records from the database by
re-running db_create with an upper sequence length.

Those are my two quick thoughts.

Good luck!

Dom
On Wed, Jun 10, 2020 at 5:45 AM Alexandre Pedro <apselvatti@gmail.com> wrote:

Dear Dom,

Please, no need to apologize, I'm really grateful that you made such a useful
program. Thanks so much for the fast reply!

It ran a little longer with batch = 50, but it returned the same error after a
while... I started a new database with max length = 4000 but i'll take a while
to finish. I'll keep you posted!

Again, thanks so much for the prompt support!

Best

Alex
On Thu, 11 Jun 2020 at 06:49, Alexandre Pedro <apselvatti@gmail.com> wrote:

Hi Dom,

unfortunately reducing the database did not solve the problem...I made
one with a max length of 50.000, and was able to reduce from 100Gb to
just 20Gb. Although I feel that this will improve my analyses downstream,
I'm still stuck with the same message,
Server says 'MALException:mat.pack:HY001!Could not allocate space'.

The thing is, I'm not sure if the problem is with my computer settings...
That message shows up right at the beginning of the download phase,
and complains about short sequences such as environmental samples
with just 600bp. But the real strange thing is that my mac does not reach
any saturation point: I keep an eye on the Activity Monitor app, and neither
my RAM or CPU gets saturated with the analysis. It feels like my R is
not using the full resource available on my computer, even though I
change the default setup to 8 cpus...RStudio does not go above 200M
in memory or 1% of CPU usage... So it doesn't look like a resource overload.

I tried several different things, as you can see in the script I have attached.
Could you please tell me if I am missing something?

Thank you for your time and patience.

P.S. I downloaded a Araceae subclade (txid<-284555#Aroideae) without restez
with no problem, but when I try the exact same thing from restez I get that error
message. I'd really like to learn how to use the restez resource, so please let me
know if there are tests we can make to get to the bottom of this issue.
On Thu, Jun 11, 2020 at 6:16 AM Dominic John Bennett <dominic.john.bennett@gmail.com> wrote:

Hmmm... I'm not sure it is your computer. We shouldn't be seeing the CPUs being used
until the BLAST stage. As for the RAM usage that indicates that we're not extracting
from the database more than you can handle. This makes sense given you've reduced
the database size and the batch size.

It's interesting that you're breaking on the same database query involving these records:
https://www.ncbi.nlm.nih.gov/nuccore/MN099113.1/ and https://www.ncbi.nlm.nih.gov/nuccore/MN099112

Are you excluding environmental sequences to try and avoid these? It looks like your
search query isn't working. I  think in your query where you have "NOT environmental"
you need to specify a field. E.g. to specify no mention of environmental in the record
title: "NOT environmental [Ti]". Are you aware that you can play around with what
parameters work here: https://www.ncbi.nlm.nih.gov/nuccore/advanced

Best,
Dom

P.S. Not sure why parameters_reset() isn't working!
On Fri, 12 Jun 2020 at 01:52, Alexandre Pedro <apselvatti@gmail.com> wrote:

Hi Dom,

I'm sorry to bother you with this, but I'd really like to understand what I could be
doing wrong.

Thanks for the suggestion, I excluded the environmental samples the correct
way. However, I'm still having issues. To rule out any restez/PhylotaR
miscommunication I might be doing, I ran just the PhylotaR on the Araceae
clade (without the restez), and after many hours it was downloaded successfully.
Nevertheless, the analysis halted again at the CLUSTER^2 stage with the error message:

Error in stages_run(wd = wd, frm = frm, to = nstages, stgs_msg = stgs_msg,  : 
    Unexpected Error in Ops.factor(blast_res[["query.id"]], blast_res[["subject.id"]]) : 
level sets of factors are different

Occurred [2020-06-11 14:03:21]
Contact package maintainer for help.

I tried a different large taxonomic group, Serpentes, to see if the problem is with
Araceae or even plants, but I get the same error. Only very small groups such as
the Aotus example work. Other lab colleagues using different computers (all Macs)
are having the same issue: we can't complete the run phase, with or without restez.
The issue seems to go away with the older version of PhylotaR (1.0), but it is much
slower. We've been following the github updates and a lot is being implemented in
version 1.2 (such as restez integration), so we would like to use the most up-to-date
version. Please let me know how we can try to understand this, and you can count
on me to run tests to sort this out.

Thanks so much

Alex
On Fri, Jun 12, 2020 at 4:36 AM Dominic John Bennett <dominic.john.bennett@gmail.com> wrote:

OK. Well at least we fixed the restez problem for now. It's probably an unexpected format
of the environmental sequences that caused the issue.

With the cluster^2 step... would it be possible for you to share your results folder?
Perhaps you compress it and upload it to a shared folder/link on google drive or
similar? Then I'd be able to investigate more properly what is going wrong with
the later version of phylotaR.

Your 95% there in terms of the pipeline run.... thanks for your persistence!

Dom
On Fri, Jun 12, 2020 at 4:50 AM Alexandre Pedro <apselvatti@gmail.com> wrote:

Hi Dom,

Gosh, thank you for your 100% patience!! After a long day I was able to download a
very large clade within Serpentes using just the PhylotaR, so it might be just as you
said, some weirdo sequences in a particular plant taxon must be causing problems
with the overall phylotaR run.

OK, I just started a fresh run for the Araceae using the same parameters just to make
sure and clean up previous frustrating tryouts, and will send you the results as soon
as I can. I've officially turned into a night owl during this pandemic, so I'm heading to
bed now at 5 am...If it goes exactly as it did earlier today, by the time I wake up it
will have finished and I'll send it right away.

Thanks so very much!!

Alex
On Fri, Jun 12, 2020 at 6:12 AM Alexandre Pedro <apselvatti@gmail.com> wrote:

Hi,

It finished pretty quickly, and at the exact same stage with the same error message.
You can download it from this link: [LINK REMOVED]

Please let me know if you have any issues with the file

Thanks!
On Fri, 12 Jun 2020 at 11:14, Alexandre Pedro <apselvatti@gmail.com> wrote:

Just to make everything easier, here's the setup line I used for the taxon 4454 (Araceae)

setup(wd=wd, txid=txid, ncbi_dr=ncbi_dr, v = TRUE, overwrite = T, btchsz = 100, ncps = 8,
db_only = F, srch_trm = "NOT predicted[TI] NOT \"whole genome shotgun\"[TI] NOT unverified[TI] NOT \"synthetic construct\"[Organism] NOT refseq[filter] NOT TSA[Keyword] NOT environmental[TI]")
On Sat, Jun 13, 2020 at 7:55 AM Dominic John Bennett <dominic.john.bennett@gmail.com> wrote:

Hi Alex,

It seemed to be quite a minor error in the end to do with factors -- not quite sure why it wasn't an issue in the older phylotar version.

You should be able to run the complete pipeline with a newly updated package from 

remotes::install_github("ropensci/phylotaR")
# after reinstalling, to be safe, I would restart the R session to make sure
# you're using the newly installed package
phylotaR::clusters2_run(wd = "[YOUR WORKING DIRECTORY]")

Thanks for your efforts, it made fixing a lot easier.

Dom
On Sat, 13 Jun 2020 at 21:59, Alexandre Pedro <apselvatti@gmail.com> wrote:

Hi Dom, 

It works perfectly. I got curious why this error did not show up in
smaller clades or even the large Serpentes clade I tried...But if it's
solved, it's solved!

I'm really happy we (you) could figure it out! I used the first phylota
browser during my MSc, but as it started to get outdated over the years
I was really sad because that idea was a major breakthrough for
phylogeneticists. So you can imagine how excited I got when your
version came out. I'm really grateful (as certainly is the entire
phylogenetics community).

Please count on me if you need anything.

Thank you so much for your patience and time. My students and lab
colleagues are also very grateful.

Best

Alex
On Sat, Jun 13, 2020 at 5:32 PM Dominic John Bennett <dominic.john.bennett@gmail.com> wrote:

Thanks Alex! I'm glad we could work it out.

It worked for the smaller clades because the smaller clades wouldn't have
needed the cluster^2 step -- clustering of clusters is only needed for the
clades where there are so many sequences/species the number of
possible combinations is too great for a single cluster step.

I was just wondering, would it be possible for me to share a redacted version
of our conversation on phylotaR's GitHub
page (https://github.com/ropensci/phylotaR/issues)? I think it's good for
others to see how potential problems can be fixed.

Best,
Dom
Oh right, that makes absolute sense.

Of course you can share it! Again, thank you for the superb support for your program(s).

Best

Alex