Closed 1yuv closed 9 months ago
I have been trying to reproduce this issue and see if there is any bad telemetry db running 4.2.x to 4.5.x branches. Like removing cookies (telemetry has the user name), clearing localStorage, offline user, online user, etc. But so far, no luck. @ralfudx can you please try and see if you can reproduce it?
I've added a stronger check to only process telemetry db with correct names and throw an exception if the dates are NaN, which can help us track better next time.
I've added a stronger check to only process telemetry db with correct names and throw an exception if the dates are NaN, which can help us track better next time.
I don't think this will work, because the deleteDeprecatedTelemetryDB
function is called before the getDbDate
function, so the new Error code will never run. I think the unit test proves this because of this line: expect(consoleErrorSpy.notCalled).to.be.true;
Right?
I don't think this will work, because the deleteDeprecatedTelemetryDB function is called before the getDbDate function, so the new Error code will never run.
I think you mean getTelemetryDBs
runs before getDbDate
. Then, the exception won't trigger now that we have the new checks. Yes, I considered not checking for the NaN because it is harder to reach. But wanted to be extra defensive, and it's not expensive.
We can throw another exception if it has telemetry
in the name but not the old telemetry, which means it's malformed.
You've added an Error to gather more info, but the Error will never be thrown (as proven by your unit tests), so we won't gather more info. This statement is incorrect: "... which can help us track better next time."
Regardless we need to spend more time reproducing it.
Have you looked in depth at the telemetry docs that are failing? Do they have anything in common? Could it be because they're in Nepal? Can we prove they're 4.2.0 based telemetry dbs, or are the errors occurring on dbs generated by 4.5.0? Is it the same users, or are the docs spread over all users? Is it still occurring or now that everyone has upgraded has the number of docs stopped increasing? Have you analyzed the logs to determine when these docs were synced? If there's not enough info in couchdb you can probably find something in postgres as these docs have been synced to pg as they came in. @1yuv may be able to help here.
Hi @latin-panda , Happy to help with data if you need more help. To answer some of the questions:
Is it the same users,
It's spread across all users.
Luckily, we still have one 4.2.0 instance and we can upgrade to 4.5.0 if this helps in debugging or pinpointing to what's causing it exactly. There's no issue in this 4.2 instance. I am also happy to provide any details you need of these users of postgres regarding pattern and timeliness.
Can we prove they're 4.2.0 based telemetry dbs
They probably are not as I can see None
telemetry generated months after the upgrade.
You might be wondering how can I see date for telemetry documnts where there's none
, none
logged?
There's a time
at the end of the telemetry id for documents which are conflicted. Le me know if I can provide any additional insight regarding these documents.
Before, we were using momentjs like this
year: date.year(),
month: date.month() + 1,
day: date.date(),
Then I changed it to moment.format(...)
, which translates the numbers to the language (२०२४-०२-०९) and generates the error.
Those two instances are still generating correct telemetry data; they are probably test users using English because their telemetry data shows 2 to 3 different device IDs and browsers with the same user. Binod suggested they are debugging this other issue.
For example:
I have removed momentjs and just use the regular Date from JavaScript
This will be released in 4.5.2
Does this fix include anything to prevent couch2pg from breaking or recommendations on how to fix the data (so couch2pg doesn't break)?
@michaelkohn It doesn't fix data already synced to CouchDB, but anything not yet aggregated will either be done right or dropped. There's a related issue in couch2pg to handle this case: https://github.com/medic/cht-couch2pg/issues/144
Additional information is that this only impacts users with languages that use non-ASCII characters for numbers, in this case Nepali.
Swahili, English, Spanish, and French are tested and working unaffected.
Describe the bug It has been observed that for many users, a telemetry document is captured without a proper date.
Look at this example telemetry doc:
To Reproduce There's no particular way to reproduce this, but these documents can be seen in many 4.5.0 production instances. I'll share the link to these documents in slack.
Expected behavior Proper date should be logged for year, month, day field.
Environment
Additional context This might have been caused after this fix. Issues related to missing days in telemetry document was fixed in that issue.
This issue has adverse effect in data reporting since we sync telemetry documents by couch2pg and since couch2pg view doesn't proper date, the couch2pg fails causing entire data transfer failing for all instances running 4.5.0.