julianharty / app-store-reviews-app

Android App to analyse reviews and scripts to load historical reviews data into a SQLite database.
Apache License 2.0
3 stars 5 forks source link

Dealing with duplicate imports of content #8

Open julianharty opened 7 years ago

julianharty commented 7 years ago

Currently the app blindly imports contents, even if it's already imported them before. I don't think it's useful to have duplicates of contents, so let's find a way to firstly detect duplicate content and then leave the existing content in-situ if we detect a duplicate is being ingested.

Of course, it'll be helpful to communicate dups were found. Quite how is to be determined as the most appropriate form of comms may depend on the number of dups found e.g. 1 is fairly practical to show in the GUI, many ain't.

julianharty commented 7 years ago

Identifying reviews uniquely is challenging based on the 4 mandatory fields in a review (2 of these reflect the same underlying data - the timestamp when the review was created, one as a formatted string, the other in milliseconds). The other 2 mandatory fields are the package name - certainly not unique and the star rating. The star rating can be changed by a reviewer when they update their review, so it isn't appropriate to use to determine uniqueness of a review. That means we need to get more sophisticated in the algorithm if we want to detect uniques and dups.

BTW: I quickly tested whether the timestamp was unique by adding a unique constraint on the combination of package_name and review timestamp in millis and then loaded a couple of files of reviews for Kiwix. I discovered 4 records were rejected in one of the 2 files. This file has 324 reviews.

Here's my code for the record sqlStatement.append("CONSTRAINT unique_review UNIQUE (package, review_submitted_millis)"); and a link to how to create a unique constraint in SQLite. https://www.techonthenet.com/sqlite/unique.php

julianharty commented 7 years ago

Several things: When I extended the uniqueness check to include the star_rating column the two main test files loaded without any duplicates. This isn't a definitive answer, I'll continue to find ways to uniquely identify reviews, possibly using a hash algorithm e.g. md5 or merkle trees.

From discussions with Damien Allison, one option is to treat updates to reviews as new reviews since they're distinct and - from the perspective of assessing their value - can be considered independently of any previous incarnation of the review. This may simplify the matching algorithm somewhat (important as the user can change the star rating so we can't use this as part of the matching algorithm).

We could reject files rather than individual reviews. This isn't perfect, but might be 'good enough'. Files can change, for instance someone could rename a file and currently it'd look like a different file (what'd be in common would be the size). I have considered calculating a hash for the complete contents of the file but that's hard to implement without significant work or processing the file's contents twice.

I notice the load is now taking a bit longer, around a second for 324 records on the Samsung Galaxy C I'm mainly using for testing. As I add code I may need to improve the UI and file processing logic.

julianharty commented 7 years ago

Another note on this thread, this time on detecting whether we're dealing with several records for a common review. The three mandatory values are not enough to make a record unique, furthermore one of these (the star rating) can be changed by the reviewer. Several of the other fields can also be changed I believe (but don't yet know for sure as judicious testing by writing and submitting reviews and updates needs to happen on Google Play) and then wait until I receive the updated reviews in the data fields to see how and where the revised information appears.

There's a data structure, a Merkle Tree https://brilliant.org/wiki/merkle-tree/ , that can create a combined hash of the fields in a data structure. Potentially this can be used to provide a signature of reviews and unambiguously identify identical reviews. At this stage I don't know if it's necessary to go to this level of checking and matching reviews so I'm writing this comment before I forget what I've learned.

Perhaps I could simply generate a CRC or Message Digest for each row in the file, however this would only work for file formatted input rather than reviews retrieved online using the respective Google Play API. Hmmmm. BTW: there's a Merkle Tree implementation in Java at https://github.com/quux00/merkle-tree which includes an example of calculating CRC32 in case I need that too/instead.

Here are a couple of additional links if I return to this topic http://java-lang-programming.com/en/articles/29 (Java implementation of using Merkle Trees for a bitcoin design). O'Reilly's Bitcoin and Blockchain e-book's section on this topic http://chimera.labs.oreilly.com/books/1234000001802/ch07.html

Generating an md5 hash http://stackoverflow.com/questions/415953/how-can-i-generate-an-md5-hash