jimmejardine / qiqqa-open-source

The open-sourced version of the award-winning Qiqqa research management tool for Windows
GNU General Public License v3.0
366 stars 60 forks source link

Dropbox synced library often loses bibtex/metadata #415

Open astaroph opened 1 year ago

astaroph commented 1 year ago

Hello,

I've been a big fan of qiqqa since the commercial days and have really been enjoying all the new features and especially the return of the sync capability to keep my libraries harmonized between my different computers. However, I've been running into a worsening issue where I'll finally have waded through all of the bibtex sniffing (and recurrent lockouts from google) and gotten my library up to date, and then on my next sign in (usually when I sign into the other device and especially after I have just synced), I will lose dozens or more of hard won bibtex data, for papers that clearly already had them (and have even been part of an Expedition already, see below. Capture

This has been occurring as of a couple of years ago at some low level, with several versions of v83. I have tried reverting to v82 and simply maintaining separate libraries on my desktop and laptop that watch the same dropbox folder (have to sniffer each paper twice to keep them up to date, but at least it is stable), but it would be very nice to be able to actually take advantage of the new sync capability. I know there is a lot going on under the hood and not enough of you all to go around, but any help here would be most appreciated! Happy to drop in my logs (extensive, when last I checked, and with various warnings and errors that may be helpful). Thanks!

GerHobbelt commented 1 year ago

Ok, this is a "known issue" that's been bothering me for quite a while already. It's one of the reasons why I've decided to do a rewrite of the software, but that's mostly due to blocking problems in the first ~100 issues listed here at gitHub. Anyway, back to your observed problem:

What I would like you to try is this work-around (technology cause-and-effect elaboration later on in this writing):

Preface

  1. yes, it may seem rather nuts to have a LOCAL sync directory as that would mean we store the Qiqqa database + PDFs twice on a local disk. Let it be. Just don't ask and I shall tell ye no lies. :wink:
  2. Make a backup before you do this. YMMV. Your most important file is the Qiqqa.library database, as that's where all the metadata is stored. bibTeX, etc. Better safe than sorry.

The Action

  1. set qiqqa up to SYNC to a LOCAL directory.
  2. SYNC to that directory. This should copy your entire library to the sync directory.[^1]
  3. then copy those files by hand onto Dropbox.

the reverse process for the other machine(s):

  1. set the other qiqqa instance up to SYNC to (its own machine's) LOCAL directory.
  2. manually copy the files off DropBox into a similar LOCAL sync directory
  3. SYNC that qiqqa instance: if all goes well, it should see the updates and pull them in from the local SYNC dir.

[^1]: Note for very observant folks: the sync dir will have a different database file. .s3db vs. .library & no Lucene.NET index directory as that stuff is recreated on every instance.

Technologic background info

Writing that in the next response msg.

GerHobbelt commented 1 year ago

Technologic background info

Why does DropBox "nuke" our Qiqqa databases?

Well, that's rather vague and a very tough nut to crack, regrettably.

First there's this: Qiqqa uses the well-known SQLite database library (a .NET port/derivative in fact, but the argument remains as-is) and apparently DropBox, Google Drive and a few others (cloud storage providers) have a hard time dealing with SQLite databases when these are read/written by SQLite, i.e. when these database files (Qiqqa.library in our case, BTW) are updated during the Qiqqa SYNC process, which is basically a large database sync action.

Why does this go wrong?

Frankly, I'm stumped. Theoretically this should not happen as SQLite uses regular random access APIs, so there's nothing fancy happening - or so you might think.

Actual reality contradicts this theory.

There's two things:

Tests thus far have led to these findings (which are rather fuzzy for these tests are irritatingly ill-willed to produce consistent results: one day you fare better or worse than another...):

[^2]: I have tried to dig out this problem, but most times when I design or run tests, DropBox seems to "notice" and quite evilly rejects to fail on me. Yes,I am anthropomorphizing the bugger. Satan is among us, or something crazy like that. Thus far, the collective cloud storage services (DropBox, googleDrive, MS OneDrive) have proven smarter than yours truly. They only fail when I don't watch for failures or only trigger very rarely when I do, making me spend bloody ages on this. Sigh.

Meanwhile Qiqqa isn't helping either as it has its own little nastiness built-in since day one (see also https://github.com/GerHobbelt/qiqqa-revengin#database-file-format-as-discovered-via-db-inspection): Qiqqa, for some very odd reason (apparently they didn't trust their own software at some point, I guess) adds a MD5 hash checksum to every bibTeX metadata record stored in the database.

Fine, that should always succeed am-I-right? Yes, usually. Unless... you start hacking the database from outside Qiqqa (which I was doing back in the day when qiqqa was still commercial) or when someone does the hacking for you: enter DropBox&friends at-times-odd-behaviour and to qiqqa that looks like "record corruption" as SQLite (temporarily or permanently) produces a b0rked record for a given query. The measure taken in the software is to DELETE the metadata record. Ouch

Okay, so what if we keep those records intact, i.e. take out the MD5 checksum check? Well, that has passed my mind as well, and I tried that one for a while too. But the corruption that DropBox manages to induce into SQLite databases is such that I then get spurious crashes and odd(er) behaviour elsewhere in Qiqqa, so that bright idea turns out to be... not so bright.

Is DropBox evil, then?

Well, insert a bit of handwaving here, but at least it works for what it's designed for, and that's file copying. Which is a sequential access pattern, rather than random access.

And that translates to...?

That means we MUST NOT use SQLite to execute the SYNC action onto DropBox et al; instead that part of the "sync" action should be a regular "file copy" action, to make sure you're not toasted by DropBox et al. Which is what the work-around described in the previous message aims for:

For advanced folks, there's rclone and similar tools, which can take care of that "file copy to/from Dropbox" action. Those are tools which are focused on Linux, but there's Windows versions too. At some point, I hope to roll my own dedicated Qiqqa sync tool, based on their technology and with extra built-in checks for I've run into some other file copy/sync troubles along the way (independent of Qiqqa) :: it turns out that at least one of my machines sometimes, if only rarely, exhibits a CRC failure when writing data to certain magical combinations of USB disk tray and HDD. So a "file verify-after-copy" is in order, too. ... As if the situation wasn't complicated enough already. :stuck_out_tongue_winking_eye:

Anyway, what we should aim for, technology-wise:

  1. Apply SQLite on local files only.

    Note: This is not a SQLite problem per se, by the way. All other databases suffer from the same issue. The Big Boys have come up with different approaches to solve this problem we are facing here, though: network protocol based access to the database instance. Which is off-limits for me & Qiqqa and I want to keep this a desktop application and able to use arbitrary network access, i.e. I hope to be/remain able to provide Qiqqa 'sync' functionality across the least common denominator of networked data sharing. Which is... DropBox et al. Yeah, painting oneself into a corner there. But that's what we need to do, when we want to provide non-tech-savvy users with an application that can "share" a library across multiple machines. :shrug:

  2. Apply other "sequential access" software / methods to copy those synced files across the given network barrier, e.g. rclone onto & off Dropbox/GDrive/OneDrive/Mega/...

Alas, the current state of affairs is that we haven't yet produced such a sync system, hence the manual work-around attempt in the previous message. :cry: My apologies....

If we dig into that concept a little deeper, we find several issues that need solving, not all of which are resolved by using 'rclone' as the answer:

  • given DropBox et al finicky behaviour when syncing files, there's a reasonable chance that some sync action can fail in various ways, including contention/collision due to another user sharing the same files and syncing/updating them around the same time: rclone does not protect you from disaster there, so we need to check the integrity of the files we obtain from the cloud that way. Then we need to decide what to do when we discover such corruption. (discard & keep previous version or refetch later?)
  • rsync and unison are tools well known in Linux Land when it comes to synchronizing machines across a network. However, those all assume a higher grade of "least common denominator re network access" then we do. Enter rclone, which addresses the same area of need as we have. (But it lacks a proper file content verify action, so there's still some chance of corruption getting through -- see my note about the USB bay+HDD observation above)
  • multi-user Qiqqa is another question that I reckon with too: we need collision resolution when multiple people happen to edit/change records somehow, or rather: when such edits are discovered to collide at sync time: we are not 24/7 networked with Qiqqa (and I don't intend to be: it should work stand-alone; similar to git for software developers) so "simultaneously editing users" can easily become "edited the same data within the same 24-hour window" or such-like. There are automated collision resolution protocols out there (commercial Qiqqa used such an idea: last one wins), but I dinf my own metadata is precious enough that I don't accept the side-effects of those automatic algorithms, which is: sometimes the algorithm can decide one of the incoming data chunks is discarded, which means you loose an edit permanently. Hence we'll need something like git mergetool style manual (and machine-assisted) merge/collision handling. Bu that for later this year....)

    See also https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type & note that stuff is potentially throwing away stuff. Hence I'm of the opinion that we need a partially-manual process a la git mergetool

While I digressed into the area of collision resolution, CRDTs and git mergetool-like manual supervision approaches, these are sideways relevant to the issue at hand: a proper Qiqqa SYNC actually would need a sequential copy to/from network (git does this, and so do rclone et al file copiers), while the local-to-local mostly-SQLite-based sync should be manually supervisable to solve "merge conflicts". The extra hair in the soup is that we then still need to make sure our local SYNC does not collide with someone else in the team doing a similar SYNC on their machine during the same time period. So this becomes a kind of iterative sync mechanism and the additional conclusion that we need "data versioning" (for three-way merging tactics for when we need merge/sync conflict resolution). Qiqqa currently does not do "versioning": it only keeps that last version of your metadata records, so we're looking at a database layout change as well, before this sync function may work as one might hope/expect.

Anyway, that's the whole of it, including the "future music" part that's team usage discussed in this latter chunk.

astaroph commented 1 year ago

Well, I appear to have really stumbled into it, haven't I! Sorry to poke the wound, as it seems that what seemed a minor inconvenience to me has been a true thorn in your side! Thank you for such a detailed and thorough answer. I have indeed been quietly scratching my head at some of these odd occurrences that I have seen pop up in the issues with dropbox and I can see that it indeed may be worthy of anthropomorphizing. I will try and implement the local sync with manual copy solution you suggest and report back.

However, at the risk of opening up another can of worms (and with none of the hard won expertise of the relevant programming issues at play), I will also put out an idea I just had based on what I believe I understand to be the issue (random access by SQLlite to a "Dropbox" folder proper). It would seem that dropbox also has the ability to not just automatically sync its own folders but also automatically "back up" a local folder like Documents or Desktop. I'm going to wildly assume that this "back up" operation is treated differently from whatever black magic they do in their own folders and hypothesize that if I tell drop box to "back up" a local sync folder that I will place in "Documents" or something benign like that, then I will be automatically be able to back up from said folder on my other machine to a similarly benign place that can also be watched. It may involve having two different online "backups" hosted by dropbox, but if its any easier and doesn't lead to corruption, maybe worth it? More once I've had the chance to test things, and thank you so much for your time, expertise and help!

astaroph commented 1 year ago

Though I suppose this may easily end up just as manual as copying and pasting if it requires manual downloads...

klmckinney commented 1 year ago

I believe that within the SQLITE database there is just one table. If this is truly the case, is SQLITE truly a necessity? Why use a relational database if you have no relations? Is it just being used as a cleaver storage manager? There might be better tool and/or data structure to use for handling the data that is more suited for cloud/network storage.

I am happy to help with this issue but I have to point out that I am mostly baffled by the logic of what Qiqqa is doing!

GerHobbelt commented 1 year ago

@astaroph:

It would seem that dropbox also has the ability to not just automatically sync its own folders but also automatically "back up" a local folder like Documents or Desktop. I'm going to wildly assume that this "back up" operation is treated differently from whatever black magic they do in their own folders and hypothesize that if I tell drop box to "back up" a local sync folder that I will place in "Documents" or something benign like that

Certainly worth a try!

Keep in mind that what I explained is not proven; given my troubles (and apparent failure) to get these problems to show up in a testable/analyzable manner, we're reasoning based on conjecture, so be ready for "surprises".

Side note: it might be handy to copy the qiqqa.library database to local backup once in a while: that might at least give us a way to compare databases and discover which metadata (bibtex) record got lost. (I'll take some time before I'll have tools to recover & re-import such raw backup data, but it's certainly on the list, as I have several like that myself from the years past, thanks to Qiqqa continuously crashing with my (large) library. (happened ever since commercial v69; multiple layers of bugs...)

GerHobbelt commented 1 year ago

@klmckinney:

I believe that within the SQLITE database there is just one table. If this is truly the case, is SQLITE truly a necessity? Why use a relational database if you have no relations? Is it just being used as a cleaver storage manager? There might be better tool and/or data structure to use for handling the data that is more suited for cloud/network storage.

I am happy to help with this issue but I have to point out that I am mostly baffled by the logic of what Qiqqa is doing!

Thanks. :-) (We should set up a voice call soon, faster discussion that way.)

re SQLite: yup, 1 table. Jimmy apparently used SQLite as a glorified key-value store. Why, I don't know, haven't talked to him, so I can only guess. (I have some assumptions, but actual history may be quite different from those guesses of mine.) Just a few comments for now as I'm way overdue for bed (3:49 AM here; gotta work tomorrow, so ho-hummm):

GerHobbelt commented 1 year ago

Oh, and always keep in mind: all databases (including the key/value NoSQL ones) are engineered for local disk storage only. None of them have been engineered to sit on top of cloud storage. The way the "big boys" do it, is run multiple instances (phrases such as "sharding" or "partitioning" when you're more into Oracle et al) and demand a solid network setup[^bofh], so nothing flaky like consumer-level cloud storage access through arbitrary ISP hardware (NATs, routers[^1], etc.) and applications you have no real control over.

Meanwhile, Qiqqa wishes to work in just such an unkempt environment.

The short end of this stick: whatever database, SQL or NoSQL, you pick, you can bet your bottom you'll be facing similar or just other crap happening out there -- remember, Qiqqa is used by several folks, who use it as a tool and are not "into computers for the sake of IT" :wink:

-- one of the selling point of "running in the cloud", I guess. Thing is, I don't wanna go there. --

[^bofh]: preferably a network with an admin on call, so you'll know it'll be rock-solid as BOFH doesn't want to be woken at night :wink:

[^1]: one example only tonight: when you've not been facing (Dutch) KPN Nexperia boxes at customer sites, like I have, than you still need to descend one further level in Hell. Those are "routers" that regularly crap up when you put some load through it (torrents are absolutely notorious for it, but no sweat: other somewhat heavy network access can "swamp" these buggers just fine and take out your network connections at random. Way different from your regular corporate router/switch setup where all those databases, SQL and NoSQL, want themselves to be found. (And, yes, the KPN says they now use better ones. Sure, honey. :rofl: )

wwweagle commented 1 year ago

An temporary/intermediate solution might be writing a new file (with a new name) every time the SQlite file updates. I got the impression that the SQlite-Dropbox pair created these headaches because they both tend to work on tiny fractions within a file. If you force them to drop their cleverness they may actually work. This will defy most of the points to use a DBMS in the first place, but it's like 3 lines of code and "editing metadata is human effort that comes at great cost" ¯_(ツ)_/¯

klmckinney commented 1 year ago

Hi @GerHobbelt , Did you see my direct email, of May 4th, about HEAD not building? Is it worth opening an issue here to discuss?

astaroph commented 1 year ago

Just an update here: I implemented your solution, @GerHobbelt, setting the sync point of each machine to a local folder (in Documents), fixing all of the missing bibtex and what not on the source library, syncing, manually copying over to dropbox, then updating the local sync point of the second machine with the dropbox files. I just synced my second machine with this updated local backup and it found all the missing metadata! Didn't transfer over the expedition swatches but I can live with that. I'll update periodically here if anything seems to be going wrong. Many thanks for a workable solution, and I look forward to seeing how things develop going forward!