medic / cht-core

The CHT Core Framework makes it faster to build responsive, offline-first digital health apps that equip health workers to provide better care in their communities. It is a central resource of the Community Health Toolkit.
https://communityhealthtoolkit.org
GNU Affero General Public License v3.0
468 stars 217 forks source link

Telemetry date not logged on telemetry documents #8837

Closed 1yuv closed 9 months ago

1yuv commented 9 months ago

Describe the bug It has been observed that for many users, a telemetry document is captured without a proper date.

Look at this example telemetry doc:

 "metadata": {
    "year": null,
    "month": null,
    "day": null,
    "user": "_sample_",
    "deviceId": "d496ea42-b24f-4864-8f99-0aa7bcda5179",
    "versions": {
      "app": "4.5.0",
      "forms": {
        "contact:clinic:create": "8-cf7d43ccfbaf887faf1b6550ecde6d8f",

To Reproduce There's no particular way to reproduce this, but these documents can be seen in many 4.5.0 production instances. I'll share the link to these documents in slack.

Expected behavior Proper date should be logged for year, month, day field.

Environment

Additional context This might have been caused after this fix. Issues related to missing days in telemetry document was fixed in that issue.

This issue has adverse effect in data reporting since we sync telemetry documents by couch2pg and since couch2pg view doesn't proper date, the couch2pg fails causing entire data transfer failing for all instances running 4.5.0.

latin-panda commented 9 months ago

I have been trying to reproduce this issue and see if there is any bad telemetry db running 4.2.x to 4.5.x branches. Like removing cookies (telemetry has the user name), clearing localStorage, offline user, online user, etc. But so far, no luck. @ralfudx can you please try and see if you can reproduce it?

I've added a stronger check to only process telemetry db with correct names and throw an exception if the dates are NaN, which can help us track better next time.

garethbowen commented 9 months ago

I've added a stronger check to only process telemetry db with correct names and throw an exception if the dates are NaN, which can help us track better next time.

I don't think this will work, because the deleteDeprecatedTelemetryDB function is called before the getDbDate function, so the new Error code will never run. I think the unit test proves this because of this line: expect(consoleErrorSpy.notCalled).to.be.true;

Right?

latin-panda commented 9 months ago

I don't think this will work, because the deleteDeprecatedTelemetryDB function is called before the getDbDate function, so the new Error code will never run.

I think you mean getTelemetryDBs runs before getDbDate. Then, the exception won't trigger now that we have the new checks. Yes, I considered not checking for the NaN because it is harder to reach. But wanted to be extra defensive, and it's not expensive.

We can throw another exception if it has telemetry in the name but not the old telemetry, which means it's malformed.

garethbowen commented 9 months ago

You've added an Error to gather more info, but the Error will never be thrown (as proven by your unit tests), so we won't gather more info. This statement is incorrect: "... which can help us track better next time."

Regardless we need to spend more time reproducing it.

Have you looked in depth at the telemetry docs that are failing? Do they have anything in common? Could it be because they're in Nepal? Can we prove they're 4.2.0 based telemetry dbs, or are the errors occurring on dbs generated by 4.5.0? Is it the same users, or are the docs spread over all users? Is it still occurring or now that everyone has upgraded has the number of docs stopped increasing? Have you analyzed the logs to determine when these docs were synced? If there's not enough info in couchdb you can probably find something in postgres as these docs have been synced to pg as they came in. @1yuv may be able to help here.

1yuv commented 9 months ago

Hi @latin-panda , Happy to help with data if you need more help. To answer some of the questions:

Is it the same users,

It's spread across all users.

Luckily, we still have one 4.2.0 instance and we can upgrade to 4.5.0 if this helps in debugging or pinpointing to what's causing it exactly. There's no issue in this 4.2 instance. I am also happy to provide any details you need of these users of postgres regarding pattern and timeliness.

Can we prove they're 4.2.0 based telemetry dbs

They probably are not as I can see None telemetry generated months after the upgrade.

You might be wondering how can I see date for telemetry documnts where there's none, none logged? There's a time at the end of the telemetry id for documents which are conflicted. Le me know if I can provide any additional insight regarding these documents.

latin-panda commented 9 months ago

Before, we were using momentjs like this

year: date.year(),
month: date.month() + 1,
day: date.date(),

Then I changed it to moment.format(...), which translates the numbers to the language (२०२४-०२-०९) and generates the error. Those two instances are still generating correct telemetry data; they are probably test users using English because their telemetry data shows 2 to 3 different device IDs and browsers with the same user. Binod suggested they are debugging this other issue. For example:

I have removed momentjs and just use the regular Date from JavaScript

latin-panda commented 9 months ago

This will be released in 4.5.2

michaelkohn commented 9 months ago

Does this fix include anything to prevent couch2pg from breaking or recommendations on how to fix the data (so couch2pg doesn't break)?

garethbowen commented 9 months ago

@michaelkohn It doesn't fix data already synced to CouchDB, but anything not yet aggregated will either be done right or dropped. There's a related issue in couch2pg to handle this case: https://github.com/medic/cht-couch2pg/issues/144

garethbowen commented 9 months ago

Additional information is that this only impacts users with languages that use non-ASCII characters for numbers, in this case Nepali.

Swahili, English, Spanish, and French are tested and working unaffected.