lavis-nlp / spert

PyTorch code for SpERT: Span-based Entity and Relation Transformer
MIT License
685 stars 146 forks source link

Prediction dataset #40

Closed ChloeJKim closed 3 years ago

ChloeJKim commented 3 years ago

Hi @markus-eberts

Thanks for making a prediction mode.

I was just wondering where I can find the conll04_predictions.json file?

Thanks! Chloe

markus-eberts commented 3 years ago

Hi,

you mean 'conll04_prediction_example.json'? Please rerun 'bash ./scripts/fetch_datasets.sh'. The example file is then saved to 'data/datasets/conll04/conll04_prediction_example.json'.

ChloeJKim commented 3 years ago

so i see this as prediction_example.json

[{"tokens": ["In", "1822", ",", "the", "18th", "president", "of", "the", "United", "States", ",", "Ulysses", "S.", "Grant", ",", "was", "born", "in", "Point", "Pleasant", ",", "Ohio", "."]}, ["In", "1822", ",", "the", "18th", "president", "of", "the", "United", "States", ",", "Ulysses", "S.", "Grant", ",", "was", "born", "in", "Point", "Pleasant", ",", "Ohio", "."], "In 1822, the 18th president of the United States, Ulysses S. Grant, was born in Point Pleasant, Ohio."]

do i need to include not tokenized sentence as well? (bolded one)

markus-eberts commented 3 years ago

This is just an example of supported data formats. You have three options to specify your sentences:

Option 1 (mostly for compatibility with our CoNLL04/SciERC/ADE dataset format): {"tokens": ["In", "1822", ",", "the", "18th", "president", "of", "the", "United", "States", ",", "Ulysses", "S.", "Grant", ",", "was", "born", "in", "Point", "Pleasant", ",", "Ohio", "."]}

Option 2 (in case your sentences are already tokenized): ["In", "1822", ",", "the", "18th", "president", "of", "the", "United", "States", ",", "Ulysses", "S.", "Grant", ",", "was", "born", "in", "Point", "Pleasant", ",", "Ohio", "."]

Option 3 (in case your sentences are not tokenized): "In 1822, the 18th president of the United States, Ulysses S. Grant, was born in Point Pleasant, Ohio."

So in case your sentences are already tokenized, your input data would look as follows: [["This", "is", "sentence", "1", "."], ["This", "is", "sentence", "2", "."], ["This", "is", "sentence", "3", "."], ...]

ChloeJKim commented 3 years ago

aww I see, so we can either choose one of three options and run the prediction.

so in the example_predict.conf, max_pairs = 1000 (does this refer to 1000 max sentences we can put into the model for prediction?), and can we change this number for more bigger dataset prediction?

markus-eberts commented 3 years ago

so in the example_predict.conf, max_pairs = 1000 (does this refer to 1000 max sentences we can put into the model for prediction?), and can we change this number for more bigger dataset prediction?

This option is a bit misleading. It just restricts the number of entity pairs in a sentence that are processed at once to lower memory consumption. In case you do not run into any memory (cpu or gpu memory) problems, just leave it at 1000. The code always processes your whole dataset.

ChloeJKim commented 3 years ago

I see, thanks for clarifying :)

On Thu, Feb 4, 2021 at 11:35 AM Markus Eberts notifications@github.com wrote:

so in the example_predict.conf, max_pairs = 1000 (does this refer to 1000 max sentences we can put into the model for prediction?), and can we change this number for more bigger dataset prediction?

This option is a bit misleading. It just restricts the number of entity pairs in a sentence that are processed at once to lower memory consumption. In case you do not run into any memory (cpu or gpu memory) problems, just leave it at 1000. The code always processes your whole dataset.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/markus-eberts/spert/issues/40#issuecomment-773553983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMQJDVNQUADSVQIX5UNHU23S5LZHDANCNFSM4XDKYEEQ .

-- *Chloe Kim* | Masters Student UC Berkeley Masters in Bioengineering chloe.kim@berkeley.edu