What's the meaning of sent_id and part_id in this dataset? - Githubissues

vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

MIT License

104 stars 37 forks source link

What's the meaning of sent_id and part_id in this dataset? #5

Closed wbqhb closed 2 years ago

wbqhb commented 2 years ago

What's the meaning of sent_id and part_id in this dataset?

vdobrovolskii commented 2 years ago

Hi!

Part_id is used for debugging purposes, as there can be multiple documents of the same name and different part_id. Also, the expected conll output format contains the part_id column, so this information has to be preserved.

Sent_id is used in two places:

When splitting the document into windows of 512 subtokens max, we don't want to split in the middle of a sentence.
When predicting spans from head words, we only consider possible boundaries within the same sentence.

wbqhb commented 2 years ago

Thank you for your reply.

wbqhb commented 2 years ago

Another question, I am a novice in pytorch. My single GPU memory is not enough. How to run this code in parallel with multi-gpus, please?

vdobrovolskii commented 2 years ago

Do you have memory issues when training or when evaluating? How much GPU memory do you have?

I'm asking because there's no simple solution to run the code with multiple gpus, but depending on your needs, I could suggest some approaches that might be helpful.

wbqhb commented 2 years ago

I want to train your model by adding my idea, I have ten GPUs with 16GB memory.

Actually, I don't know where to add this to your code.

vdobrovolskii commented 2 years ago

I would try wrapping model.bert and model.a_scorer in this function. They are the most memory intensive modules on training.

Alternatively, you can place them on different devices (using .to method).

Another way would be to try and use LMS

wbqhb commented 2 years ago

Thank you for your help. I will have a try.