Efficiently fetch attachments by path

ryanwwest commented 1 year ago

I have a path, and I want to read the corresponding Zotero attachment item (if one exists). I know that this can be done via:

        zot.add_parameters(itemType="attachment")
        attachments = zot.everything(zot.items())

and then looping through all items in attachments until one's path value matches my path value. But this is inefficient as it requires going through possibly the entire library every time, which can be 10k+ attachments or more.

It would be nice if add_parameters could support other parameters like path and contentType, though that might not be the right function for these. I'd like to filter by path and other item types before while fetching, not after fetching, to be more efficient and faster (though I don't know if zot.add_parameters(itemType="attachment") does what I'm describing or just does a less efficient for loop as well).

Is there a way to improve this as is? If not, I wonder if this could be a FR.

urschrei commented 1 year ago

What do you mean by "path"? Can you provide a concrete minimal example?

ryanwwest commented 1 year ago

Sure. My Linked Attachment Base Directory is /Users/rw/s/zk/docs and say I have a file paper.pdf within it, then a 'path' (or more aptly named, filepath) as I mention above would be/Users/rw/s/zk/docs/paper.pdf. I want to identify, if it exists, the Zotero item of type 'attachment' that corresponds to this filepath.

I'm hoping for a way to filter so that 'path', made relative to the Linked Attachment Base Directory (just 'paper.pdf' here since it's in the top-level folder), can match all attachment items that have 'linkMode' set to 'linked_file' and thus have required attribute path which I can match to 'paper.pdf'.

I have a working solution currently for this that uses the snipped of code I attached earlier, but it still takes ~4-5 seconds to fetch all attachments from a ~250 record Zotero database (on a 2021 MacBook Pro M1 Max). It seems like filtering to only return items that match this path would be best, though I'm not sure if that's possible or makes sense to implement.

urschrei commented 1 year ago

I would imagine that something like truepath/zoterosync might allow you to be a bit more efficient by building a cache to search – the Zotero API doesn't allow you to search for a path like that as far as I know.

ryanwwest commented 1 year ago

Thanks for the suggestion. I looked up the repo and you're right that it may provide a good caching solution, but in my case data goes stale very quickly so caching might not work. I will consider this, though.

About that—I'm trying to do this partially to watch for any changes in parts of the Zotero DB. I'm specifically using the attachments found above to monitor the set of items of type 'annotation' that the found attachments are parents to. More generally though, do you know if there's a way to monitor a set of items for changes and get updates if so?

My plan is to just cache a copy of all the annotations and watch for changes, or maybe do something with hashes if that's too much space, but I thought I'd ask if someone's built a better way.

urschrei commented 1 year ago

Locally? There's no way of doing that using Pyzotero since it only interacts with the (remote) web API. Zotero has a feature in the works which will provide the same API for your local library, but I have no idea when it's coming. Either way, you can't and (I assume) won't be able to monitor a path – that's an implementation detail. I suggest you think about the problem in terms of collections, items and their attachments.

That brings us to requesting only updates / changes since a given version of your library when retrieving items / collections / attachments, and that will in most cases be a pretty lightweight call so you can do it often (that's how the native sync works, and how the library I linked to works):

https://www.zotero.org/support/dev/web_api/v3/syncing

https://pyzotero.readthedocs.io/en/latest/index.html?highlight=Version#retrieving-version-information

https://pyzotero.readthedocs.io/en/latest/index.html?highlight=Version#zotero.Zotero.last_modified_version.

The since=[version] argument when retrieving items is key here.

ryanwwest commented 1 year ago

Thank you for pointing out version - I've been keeping that field but assumed that it referred to the spec version, not version of modified contents. That's just what I need to detect the changes.

I'll check out the other links you mentioned as well. If I don't need to do local caching to monitor changes, that would be great.

The biggest hangup is that it takes a long time to fetch data using pyzotero. For example, after measuring self.zot.everything(self.zot.items(itemType="annotation")), it takes 3.36 seconds to fetch 88 items of type 'annotation', and I can see the 88 number being in the thousands for some libraries. If I can avoid this call using something from Zotero itself, that would be great, but I still use zoteropy to fetch attachments as well which also takes a long time—do you know if there are any ways to improve speed?

urschrei commented 1 year ago

That seems pretty slow (I can generally retrieve hundreds of items in a few seconds), but there are a lot of variables (I assume annotations are pretty small but perhaps not) involved including how busy the servers are, your own machine etc – Pyzotero is ultimately a thin wrapper around requests, so it's not introducing any real overhead.

You could try things like making one call per collection using a connection pool (you might get rate limited, but Pyzotero respects backoff signals from the API) and then combining all the results, but it's not something I've ever had to worry about.

ryanwwest commented 1 year ago

Thanks for the insights! Using self.zot.item_versions(itemType="annotation") instead is much faster (.3 seconds for 88 items) and this comprises most of my calls, but good to know that time delay from larger calls probably isn't on pyzotero's end. Annotations items are pretty small but definitely bigger than this.

Since Pyzotero isn't likely to have a way to filter attachment fetches by filepath, I think we can close this.

urschrei / pyzotero

Efficiently fetch attachments by path #164