Persistently store scraped tweets

kinshukdua / LiveActionMap

An attempt to map the areas with active conflict in Ukraine using twitter data and NLP.

https://www.live-action-map.com

MIT License

172 stars 15 forks source link

Persistently store scraped tweets #23

Open laurin opened 2 years ago

laurin commented 2 years ago

As discussed in #16, the current storage of scraped tweets is not optimal, because the newly scraped tweets will just be appended to the existing tweets.txt-file, creating a lot of duplicates. Integrating a database is probably not necessary at this point, we could store the scraped tweets with their ID in a json-file and only add new ones in the run of the application.

laurin commented 2 years ago

We should also store the time the tweet was created and discard tweets after a certain time or allow the user to select a time-range. The latter would probably require the map to be generated client-side.

kinshukdua commented 2 years ago

I agree a json-file is probably the best option. I don't think we should generate things client side, especially because that might add unnecessary lag, especially in places where there might be very slow internet because of the current circumstances. I want to serve a static html to keep the load times as low as possible. Lets just keep set discard tweet time as a parameter server side.

Krishna-Sivakumar commented 2 years ago

We can consider SQLite here too, since it's simple and file-based. It sounds like we're performing some conditional manipulation, and this will help us cut down on time complexity.

Krishna-Sivakumar commented 2 years ago

@DomiiBunn mentioned firebase, would work here.

DomiiBunn commented 2 years ago

@DomiiBunn mentioned firebase, which would work here.

It depends on the complexity you'd look for. Firebase is a nice balance between file storage(JSON files, SQLite, etc) and standalone databases as it's almost as flexible as and handles security, hosting, high availability and at the usage, we'd be expecting it should be fully free. As long as DB reads are cached that is.

kinshukdua commented 2 years ago

The reason I'm a little hesitant about firebase is that it adds another steps for developed looking to reproduce the repo and contribute. The simpler the project, the easier it is to contribute (as long as it doesn't impact performance or features).

DomiiBunn commented 2 years ago

Use a config file and specify

useDatabaseCache: false

That way for a larger deployment it's worth caching and for personal deployment it's still working fine without added complexity

DomiiBunn commented 2 years ago

Or using redis but idk how painful it is to implement with python

And i think it would be a bit of an over kill.

sahal-mulki commented 2 years ago

I am working on a fix for duplicate tweets.

Krishna-Sivakumar commented 2 years ago

Let's just go with a json file.

DomiiBunn commented 2 years ago

Sounds good to me

sahal-mulki commented 2 years ago

Nvm, I failed miserably at it.

DomiiBunn commented 2 years ago

I'd love to help but python ain't my coup of tea

sahal-mulki commented 2 years ago

Sure-a-mundo