Open stephanegigandet opened 6 years ago
Considering that the number of scans is not part of the product, I imagine the scans should probably not be stored in the product.sto
file or the products
collection, but maybe rather a different collection. The reason is that this would produce a lot of writes writes to the .sto
file, and storing it in a different collection would enable us to implement #164 by storing non-successful scans, too. Additionally, #291 could be implemented by storing some (GDPR compliant) location of the scan with the originating software, without exposing that data through the MongoDB dump.
Agreed, I think it makes sense to use a different collection for scans. We could update the product collection with aggregate scan data nightly (e.g. a sort key for product popularity). To store the scans themselves, we could also do it nightly from the nginx logs, with possibly some preprocessing to filter out some scans (e.g. if one individual makes too many scans) and aggregation. (e.g. by country and product).
I was thinking more about not only scans, but all kind of user events:
I think it would be useful to store all those events, for each user (unless they opt-out), including not logged in users, so that we can not only derive scan statistics, but also create useful user features (e.g. scan history available both on the phone and on the web site), gamification (points, monthly lists of contributors etc.).
I'm thinking that it might make more sense to use a SQL database for those events, with columns like: timestamp, userid (or anonymized user id if not logged in), barcode, type of event (scan, edit, image upload), event status code (e.g. scan succesfull or not).
Then from that table of events, we could compute aggregates etc. Having it in a SQL database will also reduce the risk of making mongodb grow too much, and would allow us to move the database to other servers etc. later if needed.
We would need to do some rough sizing estimates (e.g. can we store 1M events a day in a table? can a single table have 1 billion lines?)
What do you think?
Instead of a typical SQL database, we could also test some time series databases like OpenTSDB, ClickHouse, or InfluxDB. Their main advantage is that they're purpose built for ingesting large amounts of time based data points. If we want to store the events in order to being able to create time based stats, storing this outside of the normal mongo collection or the .sto files sounds like a sound idea.
Sure there is some interesting innovations with time series database but I wouldn't choose a database too "esoteric".
We don't have much human resources and I think we need improved solutions, widely used and perennial. That's why I would try a good old PostgreSQL.
See: https://medium.com/@neslinesli93/how-to-efficiently-store-and-query-time-series-data-90313ff0ec20
I don't know if systems like InfluxDB are esoteric or just purpose-built for this kind of application. After all, they're used in production by Cisco, eBay, IBM, and others (according to https://www.influxdata.com/). In fact, InfluxDB outranks other engines on this ranking.
We don't have much human resources and I think we need improved solutions, widely used and perennial. That's why I would try a good old PostgreSQL.
I've never developed anything on PostgreSQL; In fact, the last time I developed anything "significant" on an open source RDBMS might have been MySQL about 10 years ago, and compared to MS SQL Server, which I've been using professionally for the last years, that was not a nice experience. š Based on that, I'll abstain my vote.
If I may ; I don't think a RDBMS database is a good idea.
It seems to me that most of the event driven architectures use NoSQL databases with file systems (HDFS mostly). It depends what you want to do with it, but you could use C* (key / value), Redis (key/value in RAM) ,ElasticSearch or clickhouse (full indexation) or Time Series databases (I don't know them very well but it seems that they are increasingly being used
If you are not a lot to work on it, a full kafka / HDFS / spark / databases architecture is not a good solution. If you need batch results (per hour or per day) maybe a simple HDFS / spark + one database could be enough.
The "product" data model is a little tricky I think because you need to update it every day. Maybe instead you could use an analytics databases (I know ES, but I heard that clickhouse was better) and register a "session" object, index all the barcodes scanned during each session and after make a count on the barcode needed. I don't know it if would be good for a long time perdiod, in the use case I work with there is a natural refresh every 6months.
PS : do you know how many barcode you could have at max ? maybe all could be stored in a REDIS database, this way a product model with upserting could be interesting.
PSS : whatever solution you choose, it's always good to save your raw data in a filesystem :)
PSSS : I don't know mongoDb but it seems a good solution for the product model since it's schema less the upserting should be much less painfull than in cassandra .
There's at least two things to consider:
Most of the time, the "best" choice for #1 is modified by #2 ;)
For edit tracking, OpenStreetMap has a LOT of tracking... each object in the DB has a full edit history stored. There's currently more than 5 billion objects in the DB.
Which DB is used for that ? Postgresql
It may not be the perfect choice, but it has so many features (want some nosql ? it handles json natively, want to handle geo data ? there's postgis, etc) that it makes it a bit universal. PG have a large community and a lot of people able to administer/tune it properly.
I'm using PG a lot (and use MySQL a lot in the past), and used it for my OpenEventDatabase projet which is time and space related.
Do not put the "time" stuff too much on the shopping list. It is one of the thing we have currently in mind, but storing activity logs and edit logs may bring new ideas less time related... that why a more universal tools seems a good option to me.
In the DB world, size matters... @stephanegigandet gives some numbers... 1M a day. We need to determine the level of detail we want to keep. Do we need to store full detailed events or just statistics about them ? A mixed solution is possible... store recent events in detail and consolidate them in statistics on a regular basis.
FYI, the largest PG DB I've setup contained 1.5 billion lines of CSV (health data) and I'm also maintaining several OSM databases around 1TB.
After being pointed to this thread on slack, Iāll leave some comments here: I agree strongly with @cquest in terms of choosing āthe right weaponā: sticking to the keep-it-simple-and-stupid principle and not piling up stacks of software and protocols which can be hardly maintained unless absolutely necessary is a reasonable path to follow.
Aside of this general note, Iād also opt on evaluating postgres as an alternative. It can do what RDBMS can do, has a proper spatial extension (cf. OpenStreetMap) which can be handy for https://github.com/openfoodfacts/openfoodfacts-server/issues/1668 and can also do NoSQL although I donāt have any personal experience with the latter. Also the database could be utilized for different purposes like user management (Iāve noticed, that the admins of OFF are configured inside a configuration file) or whatever. Interesting question would be how the performance compares (and how to get existing data there in the first place)
In terms of @stephanegigandet idea in utilizing the nginx-logs to gather some insights - this sounds like a typical application of logfile analysis using either a complete ELK-stack or graylog (and elasticsearch as backend). Having some expierience with graylog, the first idea coming to my mind would be: feed the nginx logs into graylog (assuming they contain an api call like https://world.openfoodfacts.org/api/v0/product/737628064502.json) then you could use a grok-pattern inside a input-pipeline to strip out the relevant information like the product code. After that, thereās the concept of lookup tables (https://docs.graylog.org/en/3.1/pages/lookuptables.html) that can be used to retrieve further information on the product in question. (currently, iāve only worked with csv adapters, but according to the docs database or http-queries are also possible). Finally with the enriched logs, dashboards for different purposes can be created inside of graylog or, if you need more fancy graphs and tables, you could attach grafana to graylogs elastic-databasebackend and use this for any kind of analysis (top products, where are the users from, ...).
In addition you can use this infrastructure to monitor and debug the logfiles of all the other services of all the other servers youāre maintaining as well.
So if the number of scans of a product relates to nginx-calls, thereās no number to store as it could be derived from logfile analysis. So this is an āindirectā store. Depending on how much data youāre actually keeping of the logfiles, you can also store them for a long time period. Actually you can also have different buckets with a different lifespan for different purposes: e.g. your actual server logs (syslog, auth.log, mail.log, whatever.log) for a period of 30-90d and only the off-stats-log for 2y (depending on available disk sizes and amount of log data).
Although the above mentioned GDPR compliance should not be forgotten ;)
Thanks for all the insightful comments @rajo ! The basic scan data can be derived from the nginx logs, but we would like to add more data that is not in the nginx log right now (e.g. who made the scan, so we can use that to build a scan history feature). And we would like to extend it to other product actions like creating or editing a product. So I would very much like to have the data in a database that can be easily and quickly queried. The dashboards from graylog sound interesting, but what we really want is to reuse the scan data etc. in realtime to build features inside the Open Food Facts web site and app.
For populating the DB, we could connect to the database and do inserts when there is an action, or we could populate a special log file.
So we are talking about two related but different use cases here?
Case 1: having an user profile with history and state across devices/webpages. Whenever Iām logged in to the off app on my smartphone, any product scan, add or edit will be coupled with my user profile s.t. when I switch devices (tablet, webpage on PC, ...) I can take my history with me.
This will then allow for:
In this case maybe to have something like a table āhistoryā with some structure like
user_id | product_id | timestamp | action | source |
---|---|---|---|---|
4077 | 474839473464 | 2019-09-05 08:23:45.023 | scan | user |
4077 | 474846746733 | 2019-09-05 08:25:50.002 | add | user |
4711 | 474846746733 | 2019-09-05 09:45:23.001 | search | api |
Would need some suitable indices to be queried fast. This history table could also be used for authenticated API calls.
Case 2: unauthenticated searches on the webpage or unauthenticated API calls (if they exist, didnāt check). This stuff will mainly end up in the nginx logs. Unless you have some sort of dummy user which all those requests will be accounted to. But Iām not sure if this makes much sense. So for those Iād still think some logfile analysis would be useful for gathering statistical data. This will then represent the upper bounds of all requests made against OFF - authenticated or not. Case 1 would be a subset of it.
So, is there a comprehensive list of use cases / features (perhaps weighted in importance) that should be implemented / solved with the data? This might help on investigating what data is actually needed, where it can originate from and whatās a suitable way of storing and querying it.
Hi @rajo,
The use case we are really interested in is Case 1: real users scans and actions. Users can be logged in users, or anonymous users (but we often have a kind of UUID).
We actually do not want to record other types of API calls that do not correspond to a real user actions (e.g. bots that crawl the API, or anything that makes automated API calls). Those records have no value to determine the most scanned / viewed etc. products.
The log analysis can be useful for performance of the API (to see the spikes, to find abusive robots etc.) but for that use, we mostly care about the raw number of API calls and ip address.
Case 1 is about making new features possible, which is the object of this feature request. Case 2 is for monitoring, which is a much lesser priority at this point.
We are going to implement an event repository and API for scans, photo uploads etc. using postgres and fastapi, separately from the Product Opener code base, in this new repo: https://github.com/openfoodfacts/openfoodfacts-events
It would be very interesting to store the number of scans we get for each product (at least for the official OFF apps) so that we can use it to sort by popularity, to see the "life and death" of products, seasonality etc.