oscar-project / corpus

corpus issues.
Apache License 2.0
5 stars 0 forks source link

How much data is common between the two OSCAR versions? #19

Open ibraheem-moosa opened 2 years ago

ibraheem-moosa commented 2 years ago

How much data is shared between the two versions? Do they overlap in time? Is the new version a superset of the earlier version?

Thanks in advance!

Uinelj commented 2 years ago

Hello and thanks for your question :)

The short answer is: We don't know.

We have conducted basic word occurrence counts in papers (especially in 21.09 vs. the upcoming 22.01 corpus) showing that the corpus possibly retains information about events, but we haven't checked the number of duplicate document between versions.

You may find some element of response by looking into the overlaps between CommonCrawl dumps, if there is some work on that.

@pjox Should we look into these type of stats?

pjox commented 2 years ago

I think we can do this for future versions, however I think it will be extremely difficult to do for the the 2019 and 21.09 as we didn't have any document integrity for these. With the new schema (https://arxiv.org/abs/2201.06642) and the metadata this will be way easier and it is indeed something that we can explore in the future.