tmo1 / sms-ie

SMS Import / Export is a simple Android app that imports and exports SMS and MMS messages, call logs, and contacts from and to JSON / NDJSON files.
GNU General Public License v3.0
336 stars 39 forks source link

Importing contacts results in duplicates #52

Open tmo1 opened 2 years ago

tmo1 commented 2 years ago

The jq command returned 279 contacts as expected. Tried it in android emulator to import, 836 imported. And all the contacts are imported twice. Indeed, the first contact is correct (with image and all fields) and the second is completely empty, only the name-title is correct.

Originally posted by @thanasistrisp in https://github.com/tmo1/sms-ie/issues/50#issuecomment-1233276607

tmo1 commented 2 years ago

It's going to be difficult to figure this out without being able to reproduce the problem. Android is supposed to "aggregate" "matching" contacts, but I don't see a definition of "matching" or a precise specification for the "aggregation" procedure.

How many contacts are actually present after import? Assuming you import your 279 exported contacts into an empty contacts list (e.g., in a fresh emulator image) and then turn around and export them, how many are reported as exported?

thanasistrisp commented 2 years ago

Exported again from your app says that 567 exported.

tmo1 commented 2 years ago

567 is more than twice 279, so it's not just a neat case of each contact appearing twice.

thanasistrisp commented 2 years ago

Your app in the initial export showed that 279 exported, however when importing from the app said 836, again export said 567. The 279 is the correct number that it should imported...

thanasistrisp commented 2 years ago

567 is more than twice 279, so it's not just a neat case of each contact appearing twice.

As I saw in general, twice contacts exist, but maybe some apps are shown three times as I can understand

tmo1 commented 2 years ago

I may have a solution for this, and I starting implementing it in code, but I can't really test or debug it without a contacts collection that displays the problem. Are you willing to post a redacted version of yours? You can do the following:

  1. Create a smaller collection that still has the problem, using the max-records / max_messages preference setting. (The latest commit changed its name from the latter to the former, and enabled it in non-debug builds.)
  2. Redact any information you consider private / personal / sensitive. The following command (where contacts-nnnn-nn-nn.json is the original file exported by the app, and contacts-redacted.json will be the redacted version) will remove much / most of such information:
    jq 'walk(if type=="object" then with_entries(if ((.key | startswith("display_name")) or (.key | startswith("sort_key")) or (.key | startswith("data")) or (.key == "account_name")) then .value |= "REDACTED" else . end) else . end)' contacts-nnnn-nn-nn.json > contacts-redacted.json

    You should still go through the redacted version to make sure there's nothing you don't want there, and I can take no responsibility for any sensitive information leaking through.

1Dragoon commented 3 months ago

Hey I've noticed that I get several duplicates from this, I think this should be a good enough sample. Often times I get as many as four duplicates, and I think this is why:

 grep 'account_type' ./contacts-redacted.json | sort | uniq
"account_type": "com.google",
"account_type": "com.google.android.apps.tachyon",
"account_type": "com.whatsapp",
"account_type": "org.thoughtcrime.securesms",

(removed)

My thought is it might be more useful to sha256sum+truncate each field instead of redacting it, but I think I'd need to write some actual code for that as I don't believe jq can do that.

edit: In fact I'll do something better...

edit 2: Something like this works? contacts-2024-05-23-chirodacted.json

Some useful stuff:

key: account_name value: Meet -> "dunno_0807"
key: account_name value: Signal -> "dunno_0335"
key: account_name value: WhatsApp -> "dunno_0540"

Script: https://github.com/1Dragoon/chirodactor/

Basically it finds interesting fields and attempts to normalize them, then stores them in an ordered and deduped array, then inserts the order offset in its place along with a guess of what type of data it is. Not perfect, but should be good enough easily determine which contacts are related to each other.

Sinestr0 commented 6 days ago

@1Dragoon is this issue related with #68 ? or is it completely different than that?