mozilla / ichnaea

Mozilla Ichnaea
http://location.services.mozilla.com
Apache License 2.0
575 stars 139 forks source link

Data loss when uploading measurements but API responds OK #2043

Closed zamojski closed 4 months ago

zamojski commented 1 year ago

I'm developer of Tower Collector app which contributes to MLS. One of the users from Japan had recently reported an issue related to data loss. Below quote from the email.

There has been a problem with the Mozilla Location Service since about July 2022. The problem is that the Mozilla Location Service sometimes ignores the data collected by Tower Collector when it is uploaded.

When this problem occurs, the geosubmit API of the Mozilla Location Service is still responding with HTTP 200 OK and "{}" data, so the upload appears to have completed successfully in Tower Collector. We are confident this is not a Tower Collector issue.

In some cases, the recent Mozilla Location Service is behaving in a way that even if the Upload is successful, the data parsing times out and discards the data uploaded by the user. (I haven't checked properly, but there is about a 70% chance that the data is discarded.)

...

If the data does not appear in the Mozilla Location Service after a long wait (about one day) after uploading, we assume that the data has been discarded.

@jwhitlock or someone well informed can take a look on that? It seems to be significant issue.

jwhitlock commented 1 year ago

Thanks for the report @zamojski, and I'm sorry the service is not working as expected. I'm no longer on the team supporting this codebase or MLS, but I can give some context.

Looking at the submission code:

https://github.com/mozilla/ichnaea/blob/47d34a57a3b1b39728a2fe064e81690d2dcdbdb0/ichnaea/api/submit/views.py#L41-L75

There's a few things that can go wrong. If the data is omitted, a 400 is returned. If our caching server is down, a 502 is returned. If the API key is wrong, a 200 OK is still returned. They should double-check the API key, such as seeing if it is valid on /v1/geolocate.

The fact that some are getting through could mean a processing issue. The submission is queued and processed on the backend, not during the request. The processing time is uncertain, but I'd expect within 24 hours. If the system gets overloaded, it does drop some observations. Submissions are much less likely to be dropped than other observations.

zamojski commented 1 year ago

In the described case a 200 OK is returned. I'm pretty sure the API key is fine because it's a single one for an application and it's compiled directly into code. I've also checked them in /v1/country endpoint and are valid.

@jwhitlock is there anyone left in MLS and could support us in troubleshooting?

isogame3 commented 1 year ago

Hello. I am the user who consulted @zamojski about the data loss issue at the beginning of this thread.

Recently, I conducted an experiment. I collected 24 eNB-LCID measurements from cell towers in areas where few people visit. I decided to send one piece of data to the Mozilla Location Services geosubmit v2 API every hour and see how many hours later that data appears in the hourly Mozilla Location Services Differential Cell Exports data. (If someone other than me were to send these cell tower readings to Mozilla Location Services, this experiment would be meaningless, so I chose an area where people don't often visit.)

Send eNB-LCID "X" at 00:00, send eNB-LCID "Y" at 01:00, send eNB-LCID "Z" at 02:00, etc. It will send different cell tower measurement data every hour until 23:00.

By doing this, if the eNB-LCID "X" data sent at 00:00 appears in the Differential Cell Exports data at 05:00, the background queue has accumulated for 5 hours and You can see that it is being processed.

Additionally, if you subsequently see data for eNB-LCID "Y" at 07:00 instead of 06:00, then the measurement data submitted by contributors to the geosubmit v2 API between 00:00 and 01:00 will be in MLS is sending an amount that takes 2 hours to process, and we can assume that the amount in the background queue is on the rise.

After that, if the data of eNB-LCID "Z" does not appear in the Differential Cell Exports data even after 24 hours, it is assumed that the measurement value sent to MLS at 02:00 is data lost.

Most recently, it looked like this:

submitted to MLS MLS "updated" column time time difference
2023-10-22 14:15:00 2023-10-22 19:33:27 5:18 Delayed
2023-10-22 13:15:00 2023-10-22 18:03:17 4:48 Delayed
2023-10-22 12:15:00 2023-10-22 16:37:52 4:22 Delayed
2023-10-22 11:15:00 2023-10-22 15:11:46 3:56 Delayed
2023-10-22 10:15:00 2023-10-22 14:01:43 3:46 Delayed
2023-10-22 09:15:00 2023-10-22 12:35:20 3:20 Delayed
2023-10-22 08:15:00 2023-10-22 10:52:23 2:37 Delayed
2023-10-22 07:15:00 2023-10-22 09:38:17 2:23 Delayed
2023-10-22 06:15:00 2023-10-22 07:54:12 1:39 Delayed
2023-10-22 05:15:00 2023-10-22 06:09:22 0:54 Delayed
2023-10-22 04:15:00 Discarded
2023-10-22 03:15:00 Discarded
2023-10-22 02:15:00 Discarded
2023-10-22 01:15:00 Discarded
2023-10-22 00:15:00 Discarded
2023-10-21 23:15:00 2023-10-22 03:51:57 4:36 Delayed
2023-10-21 22:15:00 2023-10-22 01:40:01 3:25 Delayed
2023-10-21 21:15:00 2023-10-21 21:57:43 0:42 Delayed
2023-10-21 20:15:00 Discarded
2023-10-21 19:15:00 Discarded
2023-10-21 18:15:00 Discarded
2023-10-21 17:15:00 Discarded
2023-10-21 16:15:00 Discarded
2023-10-21 15:15:00 2023-10-21 20:30:20 5:15 Delayed
2023-10-21 14:15:00 2023-10-21 18:31:24 4:16 Delayed
2023-10-21 13:15:00 2023-10-21 16:05:55 2:50 Delayed
2023-10-21 12:15:00 2023-10-21 14:03:42 1:48 Delayed
2023-10-21 11:15:00 2023-10-21 12:01:34 0:46 Delayed
2023-10-21 10:15:00 Discarded
2023-10-21 09:15:00 Discarded
2023-10-21 08:15:00 Discarded
2023-10-21 07:15:00 Discarded
2023-10-21 06:15:00 Discarded
2023-10-21 05:15:00 Discarded
2023-10-21 04:15:00 Discarded
2023-10-21 03:15:00 2023-10-21 08:01:32 4:46 Delayed
2023-10-21 02:15:00 2023-10-21 05:35:03 3:20 Delayed
2023-10-21 01:15:00 2023-10-21 02:49:54 1:34 Delayed
2023-10-21 00:15:00 Discarded
2023-10-20 23:15:00 Discarded
2023-10-20 22:15:00 Discarded
2023-10-20 21:15:00 Discarded
2023-10-20 20:15:00 2023-10-20 22:39:53 2:24 Delayed
2023-10-20 19:15:00 2023-10-20 19:16:19 No Delay
2023-10-20 18:15:00 Discarded
2023-10-20 17:15:00 Discarded
2023-10-20 16:15:00 Discarded
2023-10-20 15:15:00 Discarded
2023-10-20 14:15:00 2023-10-20 17:43:48 3:28 Delayed
2023-10-20 13:15:00 2023-10-20 15:38:10 2:23 Delayed
2023-10-20 12:15:00 2023-10-20 13:32:25 1:17 Delayed
2023-10-20 11:15:00 2023-10-20 11:39:38 0:24 Delayed
2023-10-20 10:15:00 Discarded
2023-10-20 09:15:00 Discarded
2023-10-20 08:15:00 Discarded
2023-10-20 07:15:00 Discarded
2023-10-20 06:15:00 Discarded
2023-10-20 05:15:00 2023-10-20 10:46:50 5:31 Delayed
2023-10-20 04:15:00 2023-10-20 08:32:27 4:17 Delayed
2023-10-20 03:15:00 2023-10-20 05:58:49 2:43 Delayed
2023-10-20 02:15:00 2023-10-20 03:03:55 0:48 Delayed
2023-10-20 01:15:00 Discarded
2023-10-20 00:15:00 Discarded
2023-10-19 23:15:00 Discarded
2023-10-19 22:15:00 2023-10-20 01:45:04 3:30 Delayed
2023-10-19 21:15:00 2023-10-20 00:32:00 3:17 Delayed
2023-10-19 20:15:00 2023-10-19 20:54:02 0:39 Delayed
2023-10-19 19:15:00 Discarded
2023-10-19 18:15:00 Discarded
2023-10-19 17:15:00 Discarded
2023-10-19 16:15:00 Discarded
2023-10-19 15:15:00 Discarded
2023-10-19 14:15:00 2023-10-19 17:05:03 2:50 Delayed
2023-10-19 13:15:00 2023-10-19 14:45:44 1:30 Delayed
2023-10-19 12:15:00 2023-10-19 12:37:50 0:22 Delayed
2023-10-19 11:15:00 Discarded
2023-10-19 10:15:00 Discarded
2023-10-19 09:15:00 Discarded
2023-10-19 08:15:00 Discarded
2023-10-19 07:15:00 Discarded

From the above experimental results, as @jwhitlock commented, it seems that the measurement data sent to the geosubmit V2 API is processed in a background queue. In most cases, after the background queue reaches 4 hours worth of processing, there will be a period of no processing, and then all unprocessed background queues will be deleted. It seems like it is.

Immediately after the deletion, measurement data submissions to his Geosubmit v2 from various applications, including Tower Collector, are processed within minutes to tens of minutes and are working extremely well. About an hour or two after the deletion took place, the background queue started to accumulate for about two hours, and within a few hours it was in the state shown above.

Therefore, contributors to Mozilla Location Services can contribute to Mozilla Location Services if they send measurement data immediately after the randomly occurring background queue deletions. On the other hand, if you send at a time when there are many background queues, the sent data will be discarded and wasted. There is no way for contributors to check how long the background queue exists, and I think there are only a few people who contribute while being concerned about such circumstances.

As mentioned in the quote from the first email, we know that Mozilla Location Services has a very high probability of deleting measurement data without processing it. To address this issue, we use the functionality of measurement data transmission applications such as Tower Collector to back up measurement values before sending them. After sending, check the Mozilla Location Services Differential Cell Exports data that is generated every hour, and if the Cell ID of the new cell tower is not displayed for about a day, we will resend the backup data.

I would like this problem to be resolved if possible. Thank you for reading

webbdays commented 1 year ago

https://ichnaea.readthedocs.io/en/latest/rate_control.html#processing-backlogs-and-rate-control This might be helpful.

zamojski commented 1 year ago

So this seems to be known issue / design flaw... But this has a huge impact on database completeness. Maybe some extra resources can be added? @jwhitlock is there any chance someone will work on it?

mostlygeek commented 1 year ago

Hi, we're discussing this internally. No updates right now. You can @ me as MLS falls in my portfolio.

zamojski commented 11 months ago

@mostlygeek is there any progress on discussion?

mostlygeek commented 11 months ago

Nothing to share yet. We are discussing this and very much appreciate the patience.

zamojski commented 10 months ago

@mostlygeek Do you have any conclusions after discussion? Is there a chance someone will work on improvements?

alexcottner commented 10 months ago

I found that we had a 1 second delay between processing batches of data. Doing some simple local testing (just flooding the API with traffic from jmeter) showed that this causes the data queues to get backed up pretty quickly. I've submit a PR to remove the delay and will monitor once that is deployed to see if things have improved.

ETA: Targeting a release early next week.

webbdays commented 10 months ago

Another solution might be giving the api user the ability to check the header in api requests or another endpoint to know the status of the backlogs or the current traffic so that they can make proper decisions like when and how to use the api. This minimises the data loss and saves time in waiting.

mjaakko commented 10 months ago

Is the merged fix already in use by MLS? It seems that data submissions are still quite unreliable

mostlygeek commented 10 months ago

We have deployed a fix that has led to some improvements, but not to the extent anticipated. Further investigation revealed that a substantial architectural overhaul is necessary to ensure no data loss and faster processing. Currently, MLS lacks the necessary engineering resources for such an extensive change so we won't be able to address this at this time.