hltcoe / HC3

HLTCOE CLIR Conversation Collection
Other
0 stars 0 forks source link

Question about H3 dataset and Twitter API Access #1

Closed jonghwi-kim closed 1 week ago

jonghwi-kim commented 2 weeks ago

I’m a researcher interested in CLIR and greatly appreciated your published paper. I have a question about the HC3 dataset. From what I understand, accessing it might require using the Twitter API, potentially needing a Basic or Pro API account under X API v2.

Could you kindly confirm if this is correct? If so, could you please advise on the minimum level of API account necessary to assemble the document collection?

Thank you for your help.

Best regards, Jonghwi Kim

dlawrie commented 1 week ago

We were able to build the collection with a research account. It is unclear if that now provides sufficient access. If you are eligible, I would first request a research account or investigate if that allows post pulls. If it does not provide access or you are unable get a research account, I would start with harvesting tweets from the Internet Archive's Twitter Stream Grab. We did not use this route so I cannot say the proportion the of the collection that is available from this source; however, it is the same timeframe as the collection. I recommend this because if you need an X account for the entire collection, I think the Basic account will be too rate limited to build the collection in a reasonable amount of time.

jonghwi-kim commented 1 week ago

Thank you for your quick response. I have a few follow-up questions.

Unfortunately, from what I know, the Twitter API policy changed in 2023, making the Research Account unavailable. Therefore, I’ll need to rely on the Internet Archive's Twitter Stream Grab to retrieve the tweet content corresponding to the tweet IDs.

Considering that the source you used for HC3 differs from the Internet Archive's source, could this mean that some tweets might not be found? If so, would you expect this to result in slight differences from the statistics reported in paper?

Additionally, in section 3.1 Document Creation, I noticed that most tweets were based on the Internet Archive, with additional ones obtained through a live search of Twitter. Could you kindly clarify if "live search" refers to directly searching on the web rather than using the API?

Thank you again for your help.

dlawrie commented 1 week ago

I think that you will likely need to extend beyond the Internet Archive to make the collection usable. It is unfortunate the research account is unavailable. As part of the annotation process, annotators accessed Twitter and performed searchers in 2021 and perhaps 2022. These were the initial relevant documents for each topic. In a second phase, annotators judged other conversations that were part of the Internet Archive. The Twitter API was used to pull Tweets identified by annotators as well as Tweets in a conversation chain. I do not have counts of the number of Tweets that are not part of the internet archive, but Tweets not part of the Internet Archive are more likely to have judgements because of the way the collection was created.

We did anticipate that you will not have precisely the same collection that we scored, so we made our runfiles publicly available and provide a script that will score our runs based on the collection that you are able to create. You will therefore be able to compare your approach to our baseline approaches. We also urge that you will make your runfiles available so future users of the collection can use the same approach to evaluate their systems.

As an aside, we provided this script to account for messages becoming unavailable, but not with the inability to pay for access. The implication of this assumption is that we thought that the collection would become strictly smaller over time. If you decided to only evaluate on the internet archive subset of the collection, it would be helpful for you to post the ids that are part of that subset. Another risk to that approach is that it will be harder to statistically distinguish systems because there will be fewer judgments. You may even loose topics since you won't want to evaluate over topics for which you have no relevant documents. I recommend that you check for this condition once you create your version of the collection.

jonghwi-kim commented 1 week ago

Thank you so much for your quick and thoughtful response. It has cleared up many of my questions, and I truly appreciate your help.

I will proceed by utilizing the Twitter data from the Internet Archive as much as possible to build the collection and check its coverage. Should I end up using this data, I will make sure to share the ID of the subset and the runfiles I used for future users.

Once again, thank you for your kind and helpful response.