Closed jo-kyeongbin closed 2 years ago
Hi!
Thank you for your interest. Sorry for taking a long time to reply, I was on vacation.
However, I am curious as to how time and memory were measured to get good results as in the paper. For example, I am curious about your own measurement method, such as simply measuring the time of 'self.run'.
The time reported in the paper was obtained by looking at the time elapsed by tqdm during evaluation. The peak memory consumption was tracked using nvidia-smi reported GPU usage.
Also, when evaluating, memory accumulates as the evaluation progresses. In my case, it was recorded as 1.4GB for the first time, but by the end of the last time it occupied 2.1GB.
PyTorch does not always free unused memory, so if it needed 2.1 GB to process one document, it might not release the memory for the following documents, even if the memory consumption to process them is lower.
I'd love to hear your own thoughts about the different results. For example, the result of using a separate module called span predictor, the result considering all spans, etc.
The obvious two things that are different between your measurement setup and mine are the data and the encoder. If you could produce the same evaluation with using a larger encoder (say, bert-large) and with using conll data, I would have more input to reason about the causes of such differences.
Thank you for answer!
about word-level coreference resolution paper's table 4 (memory and time)
I have start-to-end coreference resolution korean model. and also have word-level coreference resolution korean model. but, when I tried to run model in my test data, the results was as follows.
GPU : RTX TITAN 24GB model : BERT-base
As mentioned in the paper, other factors are thought to exist. However, I am curious as to how time and memory were measured to get good results as in the paper. For example, I am curious about your own measurement method, such as simply measuring the time of 'self.run'. Also, when evaluating, memory accumulates as the evaluation progresses. In my case, it was recorded as 1.4GB for the first time, but by the end of the last time it occupied 2.1GB.
If the reason is as mentioned in the paper, " However, they should not be directly compared, as there are other factors influencing the running time, such as the choice of a framework, the degree of mention pruning, and code quality. "
I'd love to hear your own thoughts about the different results. For example, the result of using a separate module called span predictor, the result considering all spans, etc.
Thanks for your good paper!!