princeton-nlp / DensePhrases

[ACL 2021] Learning Dense Representations of Phrases at Scale; EMNLP'2021: Phrase Retrieval Learns Passage Retrieval, Too https://arxiv.org/abs/2012.12624
https://arxiv.org/abs/2012.12624
Apache License 2.0
605 stars 78 forks source link

Recipe to build dense representations from corpus #34

Open vabatista opened 1 year ago

vabatista commented 1 year ago

HI,

I'm trying to create a dense representations from my corpus and search paragraphs/phrases by keywords or a question. I don't have labeled Questions and Answers and I don't need for now to get answers, just retrieve documents possibly containing the answer.

I build a JSON with my corpus (pt-br) like this:

{
    "data": [
        {
            "title": "Radicais livres: o que são, efeitos no corpo e como se proteger",
            "paragraphs": [
                {
                    "context": "Os radicais livres ...""
                },
                {
                    "context": "Desta forma, quanto menos radicais livres, ..."
                }, ...

then I ran the following commands:

python generate_phrase_vecs.py \
    --pretrained_name_or_path SpanBERT/spanbert-base-cased \
    --data_dir ./data \
    --cache_dir ./cache \
    --test_file ../tua-saude/all_data.json \
    --do_dump \
    --max_seq_length 512 \
    --fp16 \
    --filter_threshold -2.0 \
    --append_title \
    --output_dir ./data/densephrases-multi_sample \
    --load_dir princeton-nlp/densephrases-multi

python build_phrase_index.py \
    --dump_dir ./data/densephrases-multi_sample/dump \
    --stage all \
    --replace \
    --num_clusters 128 \
    --fine_quant OPQ96 \
    --doc_sample_ratio 0.3 \
    --vec_sample_ratio 0.3 \
    --cuda

python scripts/preprocess/compress_metadata.py \
    --input_dump_dir ./data/densephrases-multi_sample/dump/phrase \
    --output_dir ./data/densephrases-multi_sample/dump

Those commads looks like working fine. Here the contents of output_dir image

Now, when I try to use the model:

model = DensePhrases(
     load_dir='princeton-nlp/densephrases-multi',
     dump_dir='./data/densephrases-multi_sample/dump/',
     index_name='start/128_flat_OPQ96'
)

This error raises:

>>> 
This could take up to 15 mins depending on the file reading speed of HDD/SSD
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/projetos/u4vn/DensePhrases/densephrases/model.py", line 52, in __init__
    self.truecase = TrueCaser(os.path.join(os.environ['DATA_DIR'], self.args.truecase_path))
  File "/projetos/u4vn/DensePhrases/densephrases/utils/data_utils.py", line 366, in __init__
    with open(dist_file_path, "rb") as distributions_file:
FileNotFoundError: [Errno 2] No such file or directory: './data/truecase/english_with_questions.dist'

What am I missing? What file is this?

Jenny-Jo commented 1 year ago

I guess it's because of the env $DATA_DIR configuration. This kind of error raised to me when config.sh didn't work properly. Whenever it happens, I just execute config.sh, and it works fine.