vdobrovolskii / wl-coref

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"
MIT License
104 stars 37 forks source link

Hello, I have a question about your paper. #32

Closed jo-kyeongbin closed 2 years ago

jo-kyeongbin commented 2 years ago

about word-level coreference resolution paper's table 4 (memory and time)

I have start-to-end coreference resolution korean model. and also have word-level coreference resolution korean model. but, when I tried to run model in my test data, the results was as follows.

571 documents (Length less than 512 sequences) model time memory
s2e-coref 5.4s 1.6GB
wl-coref 10s 2.1GB

GPU : RTX TITAN 24GB model : BERT-base

# my measurement
with output_running_time():
            model.evaluate(data_split=args.data_split,
                       word_level_conll=args.word_level)

As mentioned in the paper, other factors are thought to exist. However, I am curious as to how time and memory were measured to get good results as in the paper. For example, I am curious about your own measurement method, such as simply measuring the time of 'self.run'. Also, when evaluating, memory accumulates as the evaluation progresses. In my case, it was recorded as 1.4GB for the first time, but by the end of the last time it occupied 2.1GB.

If the reason is as mentioned in the paper, " However, they should not be directly compared, as there are other factors influencing the running time, such as the choice of a framework, the degree of mention pruning, and code quality. "

I'd love to hear your own thoughts about the different results. For example, the result of using a separate module called span predictor, the result considering all spans, etc.

Thanks for your good paper!!

vdobrovolskii commented 2 years ago

Hi!

Thank you for your interest. Sorry for taking a long time to reply, I was on vacation.

However, I am curious as to how time and memory were measured to get good results as in the paper. For example, I am curious about your own measurement method, such as simply measuring the time of 'self.run'.

The time reported in the paper was obtained by looking at the time elapsed by tqdm during evaluation. The peak memory consumption was tracked using nvidia-smi reported GPU usage.

Also, when evaluating, memory accumulates as the evaluation progresses. In my case, it was recorded as 1.4GB for the first time, but by the end of the last time it occupied 2.1GB.

PyTorch does not always free unused memory, so if it needed 2.1 GB to process one document, it might not release the memory for the following documents, even if the memory consumption to process them is lower.

I'd love to hear your own thoughts about the different results. For example, the result of using a separate module called span predictor, the result considering all spans, etc.

The obvious two things that are different between your measurement setup and mine are the data and the encoder. If you could produce the same evaluation with using a larger encoder (say, bert-large) and with using conll data, I would have more input to reason about the causes of such differences.

jo-kyeongbin commented 2 years ago

Thank you for answer!