gwu-libraries / TweetSets

Service for creating Twitter datasets for research and archiving.
MIT License
25 stars 2 forks source link

tweetset_loader tweet count is misleading #169

Open kerchner opened 2 years ago

kerchner commented 2 years ago

tweetset_loader looks at all files in the folder and simply counts lines in the files and produces a message at the console such as:

INFO:__main__:Counting tweets in 34 files.
INFO:__main__:191,631 total tweets

Following our documentation for loading to tweetsets results in the creation of other files in the folder that should not be counted, such as files containing concatenated contents from all of the tweet ID files, etc. - the result being that tweetset_loader counts lines in more files than necessary, leading to a wildly inaccurate tweet count.

Relevant code is here: https://github.com/gwu-libraries/TweetSets/blob/master/tweetset_loader.py#L319-L322

Since this is a back-end function, I would suggest simply making the message less specific, rather than spending effort to make it more precise. This will at least avoid creating the appearance to the person invoking the load that something isn't correct.