Issue reproducing Logistic Regression baselline result

s4zong / extract_COVID19_events_from_Twitter

Annotated corpus and code for "Extracting COVID-19 Events from Twitter".

GNU General Public License v3.0

46 stars 17 forks source link

Issue reproducing Logistic Regression baselline result #1

Closed Chacha-Chen closed 4 years ago

Chacha-Chen commented 4 years ago

Hi,

I was trying to reproduce the baseline results.

FileNotFoundError: [Errno 2] No such file or directory: '/data/zong/scraper_covid-MERGE/annotation/positive-FINAL.jsonl' FileNotFoundError: [Errno 2] No such file or directory: 'data/test_positive.pkl'

task_type_to_datapath_dict = { "tested_positive": ("/data/zong/scraper_covid-MERGE/annotation/positive-FINAL.jsonl", "data/test_positive.pkl"), "tested_negative": ("/data/zong/scraper_covid-MERGE/annotation/negative-FINAL.jsonl", "data/test_negative.pkl"), "can_not_test" : ("/data/zong/scraper_covid-MERGE/annotation/can_not_test-FINAL.jsonl", "data/can_not_test.pkl"), "death": ("/data/zong/scraper_covid-MERGE/annotation/death-FINAL.jsonl", "data/death.pkl"), "cure": ("/data/zong/scraper_covid-MERGE/annotation/cure-FINAL.jsonl", "data/cure.pkl"), }

Sorry that I might not understand what is going on and what are those files. Could you plz give more instructions on how to get those files? or descriptions.

Thanks.

s4zong commented 4 years ago

Hello,

The code you run is using our own data files which we couldn't share directly due to Twitter policies.

Could you use the version under the shared_task folder? I provide a cleaned version for LR baseline, but you have to prepare the data files though.

Thanks, Shi

Chacha-Chen commented 4 years ago

Hi Shi,

Thanks for the quick reply.

Could you provide more instructions on how to prepare my own data files.

It seems that the xxx.pkl files are generated using data_preprocessing.py from xxx.jsonl filles.

I am confused that where do I need to download my own files?

Thanks.

s4zong commented 4 years ago

Hello,

I have a README.md file here: https://github.com/viczong/extract_COVID19_events_from_Twitter/tree/master/shared_task.

Thanks,

Chacha-Chen commented 4 years ago

Hi Shi,

Is the data_processing.py provided as a script to generate data files for LR and BERT, converting from provided jsonl files?

I checked your README.md. I'm sorry but I am confused about the required input jsonl file for the data_processing.py, given your provided annotated data and my downloaded tweets.

For example, the provided xxx.jsonl fille do not have keys consensus_annotation and candidate_chunks_with_id.

Thanks.

s4zong commented 4 years ago

Hello Chacha,

Currently to run the baseline model for the shared task, you will only need the files under shared_task folder (i.e., you don't need to use data_processing.py from model folder). If you just format your data as in https://github.com/viczong/extract_COVID19_events_from_Twitter/tree/master/shared_task, you will not need consensus_annotation and candidate_chunks_with_id fields.

[('Tom Hanks and his wife have both tested positive for the Coronavirus.',
 'Tom Hanks',
 '<Q_TARGET> and his wife have both tested positive for the Coronavirus .',
 ['Tom Hanks', 'his wife'],
 1)]

We don't provide a script for data pre-processing for the shared task.

(Yes data_processing.py is what we use to generate data files for LR and BERT, but it is written to deal with our own data files. We don't get the time to totally clean the code to deal with different input format.)

I hope it helps.

Thanks, Shi

Chacha-Chen commented 4 years ago

Got it. Thanks.

Chacha-Chen commented 4 years ago

Hi Shi,

For the submission. Do we need to provide the part1.response as well? or only part2.xxx.response are needed?

Thanks.

s4zong commented 4 years ago

Hi Chacha,

I think our current plan is to evaluate on slot filling questions (part2.xxx.response).

Thanks, Shi

Chacha-Chen commented 4 years ago

Hi Shi, Thanks for the quick reply. One quick following question.

The final model takes input of full text, and output corresponding part2.responses. Am I understanding correctly? Will the part1.responses provided alongside the full text?

Thanks.

s4zong commented 4 years ago

Hi Chacha,

Yes you are correct. I think we will provide the full text along with the candidate choices, and then the model makes predictions for those slot filling questions by selecting from the provided candidate choices. I don't think we will provide the part1.responses.

Thanks, Shi