jvirkki / dupd

CLI utility to find duplicate files
http://www.virkki.com/dupd
GNU General Public License v3.0
113 stars 16 forks source link

Possibly add database schema to documentation #26

Open priyadarshan opened 4 years ago

priyadarshan commented 4 years ago

We have been using dupd for a while now, and we would like to study a bit more its innards.

Having perused the already excellent documentation, I could not find the schema used for sqlite db. Would it be possible to add a page with some details?

jvirkki commented 4 years ago

dupd is strict about maintaining backward compatibility within major versions for all documented interfaces (basically, anything that is documented in the man page or under docs).

I have purposefully avoided documenting the schema because once documented it means I can't change it until the next major release. Although the schema has been fairly stable, it is relatively closely tied to implementation details which means I don't want to get stuck not being able to change it if I need to refactor some of the code.

Eventually I'd like to document at least parts of it, but that's at least a few years away.

A more maintainable way of obtaining the data is to implement access commands in dupd itself which export the desired data. That way I can change the schema whenever I need but keep the command output stable. For example the "dupd report" command is basically just a sorted dump of the duplicates table but you don't need to know the schema details to run it.

If you can file one (or more) tickets describing the data you'd like to extract from the db, I can look into providing access functions that do it in a schema-independent way.

priyadarshan commented 4 years ago

Thank you, I understand. When I was preparing the ticket I was thinking of mentioning the necessity of an API, but I did not want to make it complicated or more burdensome.

In my case, I would need to know all the fields available before I could formulate the queries I could make out of them. Basically, I would need a way to discover what is underneath. An API would be ideal. Of course I could look at the Sqlite db schema, I just was hoping to have a more "commented" version of that.

jvirkki commented 4 years ago

There isn't that much in the db to be honest. The output of "dupd report" is basically it.

For duplicates, file_size:file_count:list of files. And a timestamp. The cache db has one or more hashes (if multiple algorithms) for each file, but only for large files.

I assume you know about the --format option for report? This will give you the core info out of the db:

dupd report --format csv (or json)

priyadarshan commented 4 years ago

dupd report --format csv

Thank you, I did read about it but somehow missed to try. That is quite useful.