Search Term Quoting Issue in the Query Command

complexbrains commented 2 years ago

Dear @jeromedockes

Sorry to bother you with this silly issue but I wondered do you think there might be a change with using the quote in the query with eutils somehow that might be giving me the error below? Since if I use the quotes with the query term as written in Readme as;

nqdc_full_pipeline ./nqdc_data -q 'fMRI[title]'

gives me zero downloads hence cannot move on with the further processes as in the below output:

NFO    2022-06-12T21:32:25+0100        _download       Downloading data in nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\articlesets
INFO    2022-06-12T21:32:25+0100        _download       Performing search-isil
DEBUG   2022-06-12T21:32:25+0100        _entrez sending request: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
DEBUG   2022-06-12T21:32:25+0100        connectionpool  Starting new HTTPS connection (1): eutils.ncbi.nlm.nih.gov:443
DEBUG   2022-06-12T21:32:25+0100        connectionpool  https://eutils.ncbi.nlm.nih.gov:443 "POST /entrez/eutils/esearch.fcgi HTTP/1.1" 200 None
DEBUG   2022-06-12T21:32:25+0100        _entrez received response. code: 200; reason: OK; from: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
INFO    2022-06-12T21:32:25+0100        _entrez Search returned 0 results
INFO    2022-06-12T21:32:25+0100        _entrez efetch starts
INFO    2022-06-12T21:32:25+0100        _entrez {'WebEnv': 'MCID_62a64d5816a2c0199869ec34', 'query_key': '1', 'retmax': 500, 'retstart': 0, 'db': 'pmc'}
INFO    2022-06-12T21:32:25+0100        _entrez Downloading 0 articles (in 0 batches)
INFO    2022-06-12T21:32:25+0100        _download       Finished downloading articles in nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\articlesets
INFO    2022-06-12T21:32:25+0100        _download       All articles matching the query have been downloaded
INFO    2022-06-12T21:32:25+0100        _articles       Extracting articles from nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\articlesets to nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\articles
INFO    2022-06-12T21:32:25+0100        _articles       Extracted 0 articles from nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\articlesets to nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\articles
INFO    2022-06-12T21:32:25+0100        _data_extraction        Extracting data from articles in nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\articles to nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\subset_allArticles_extractedData
INFO    2022-06-12T21:32:26+0100        _data_extraction        Done extracting article data to csv files in nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\subset_allArticles_extractedData
INFO    2022-06-12T21:32:26+0100        _vectorization  vectorizing nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\subset_allArticles_extractedData using vocabulary neuroquery_data\neuroquery_model\vocabulary.csv to nqdc_data\query-6f36c4b385abadbf7b58bb928d7ee4c8\subset_allArticles-voc_e6f7a7e9c6ebc4fb81118ccabfee8bd7_vectorizedText
DEBUG   2022-06-12T21:32:26+0100        _vectorization  vectorizing articles 0 to 0 / 0
Traceback (most recent call last):
  File "C:\users\bilgi\anaconda3_new\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\users\bilgi\anaconda3_new\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\bilgi\Anaconda3_new\Scripts\nqdc_full_pipeline.exe\__main__.py", line 7, in <module>
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_commands.py", line 258, in full_pipeline_command
    extracted_data_dir, **_voc_kwarg(args.vocabulary_file)
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_vectorization.py", line 107, in vectorize_corpus_to_npz
    extracted_data_dir, output_dir, vocabulary_file
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_vectorization.py", line 125, in _do_vectorize_corpus_to_npz
    extraction_result = vectorize_corpus(extracted_data_dir, vocabulary_file)
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_vectorization.py", line 269, in vectorize_corpus
    for k, v in counts.items()
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\nqdc\_vectorization.py", line 269, in <dictcomp>
    for k, v in counts.items()
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\sklearn\preprocessing\_data.py", line 1711, in normalize
    estimator='the normalize function', dtype=FLOAT_DTYPES)
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\sklearn\utils\validation.py", line 72, in inner_f
    return f(**kwargs)
  File "C:\users\bilgi\anaconda3_new\lib\site-packages\sklearn\utils\validation.py", line 653, in check_array
    context))
ValueError: Found array with 0 sample(s) (shape=(0, 7547)) while a minimum of 1 is required by the normalize function.

however, it works fine if I query the term without the quotes.

Similarly nqdc_download -q 'fMRI[Title] AND ("2019"[PubDate] : "2019"[PubDate])' nqdc_data also seems like not working with the way it is constructed and gives the error below:

nqdc_download: error: unrecognized arguments: (2019[PubDate] : 2019[PubDate])' nqdc_data

however, it works well if I change it to:

nqdc_download -q "fMRI[Title] AND (2019[PubDate] : 2019[PubDate])"

So tiny differences but I just wondered is there a workaround that I am missing somehow, which might be highly likely 🙄 or would it be due to a change eutils might have done with querying somehow?

If you think these need to be changed in the repo, given I know how busy you are, I would be more than happy to make the PRs as needed.

Thank you 🤗

jeromedockes commented 2 years ago

Hi @complexbrains,

no bother at all, thank you very much for reporting this! I think the problem is that the README example commands use single quotes (') around the query, but the Windows command line does not treat single quotes as quotes, only double quotes ("). We should make the README examples compatible with the Windows command line by using double quotes instead (of course any double quotes in the query itself would have to be escaped or removed if they are not necessary). In the meanwhile using " instead of ' as you did should fix the problem

jeromedockes commented 2 years ago

Also I will add a check to stop more gracefully when the search returns no results. BTW I am not sure what version you are using but the release on PyPI is quite outdated so for now it is best to clone the repository and use the latest commit in the main branch!

complexbrains commented 2 years ago

Thank you so much for the reply, really helps indeed.

Actually I tried both in MacBook and Windows 10 and with the code in v 0.0.1 and the code in the main brunch, but in both cases, the double quote only works when the query contains term and date combinations.

So no quoting only works if there is a single term, and double quoting only works with term + date but not with a single term or several terms to query. So not having a consistent pattern makes it really hard to guess what is the reason atm but I will be digging on it soon too!

Surely I love the features in the main repo! Will make sure hackathon participants download that version too so we can build upon. I had one error with running the pipeline with the main branch, but I have to repeat it and then will report it too.

Thank you so so much🤗

jeromedockes commented 2 years ago

Thanks for this information; I still think this has to do with the shell's parsing of quotes and splitting of the command line rather than nqdc itself, but it would certainly help if you could provide the exact commands and corresponding errors.

In any case writing the query in a file and using --query_file instead of --query may be easier, as in that case the query is not provided in the command line so no quoting or escaping is necessary. you may also want to check the query in the PubmedCentral website search bar.

glad to hear you will use this in the (brainhack?) hackathon, is the project description online? I could push a release to PyPI before if you think it is necessary. let me know of any other problems you run into!

complexbrains commented 2 years ago

Oh surely, it should be something to do with the shell indeed. I do not have access to the MacBook I tried currently but the command I tried and worked with windows 10 with version 0.0.1 is

nqdc_full_pipeline -q "aphasia fMRI AND 1995[PubDate] : 2022[PubDate]" --api_key APIKEY or nqdc_full_pipeline -q aphasia --api_key APIKEY

I tried with the text file but gave me below error

nqdc_full_pipeline -f D:\query_file.txt ncdc_data --api_key APIKEY

writing the query word in the file either with single quote, double quote or no quote

ValueError: Found array with 0 sample(s) (shape=(0, 7547)) while a minimum of 1 is required by the normalize function.

somehow it cannot read the word I believe. At the moment I have a work around but will have a more intensive look after brainhack finishes.

And indeed, finally we are ready to participate in OHBM Brainhack with NeuroCausal as Vale shared its details at our meeting. We will be using the data you provided but start with filtering it out to the clinical studies and move on from there. We are working on the pipeline I will share it with you once it is finalized so would be happy to hear your thoughts on that.

Thank you so much again for your help, will update you on this issue for sure!

jeromedockes commented 2 years ago

thanks for trying it! could you share query_file.txt ?

jeromedockes commented 2 years ago

I tried with the text file but gave me below error

nqdc_full_pipeline -f D:\query_file.txt ncdc_data --api_key APIKEY

writing the query word in the file either with single quote, double quote or no quote

I just tried on windows command prompt and cannot reproduce the issue; if you are still experiencing this with the current main branch version please let me know and share the query file you used! otherwise I updated the README so I believe this should be OK now

jeromedockes commented 1 year ago

after IRL discussion this seems to be resolved so closing

neuroquery / pubget

Search Term Quoting Issue in the Query Command #4