Open LeoQuote opened 1 year ago
verify was never tested for anything other than standard POSIX file systems. For non POSIX, I'm open to many ideas.
I'm happy to add checkpointing support to verify as a temporary workaround (to get around when it's interrupted).
Now, in regards to your idea:
Python packages and their naming + metadata isn't as strictly enforced as one would like. So making guesses is going to land you in a bad place.
Now that we have the simple API in JSON, maybe pull that index file you've generated and then work on each project from that list and use the success/failed checkpoint feature I alluded to up above. Due to the site of the mirrors, it's hard to do this efficiently without getting more metadata + applying it to PyPI every time I've thought about it. But happy to be proven wrong.
I have not found any clear MUST
for project name and package name, I think you're probably right about the "guessing", we should not do that.
So about the checkpoint part, my thoughts are like follows:
We can have persistent storage in both stage 1 and stage 3, storing package list and project list into text files, one line a project/package, if there are any properties (active, inactive, update time, etc...), separate them with comma, maybe just use CSV to make it simpler.
During stage 1,3, there's also a checkpoint to indicate which is the next package, my suggestion would be every 10000 project/package, store it in a new file, so an interruption would only lose about progress of 10000 project. The file structure would be like
verify_data/
├── desinated_package
│ ├── package_list.csv.1
│ └── package_list.csv.2
├── package
│ ├── package_list.csv.1
│ └── package_list.csv.2
└── project
├── project_list.csv.1
└── project_list.csv.2
For stage 2 and stage 4, they are similar, read the files generated in the previous stage, load into memory (a dict to have O(1)
complexity), and compare it.
If verify is interrupted, first determine which stage were we? We can have a separate verify_status.json file to show which stage were we.
Read the last line of last package_list.csv or project_list.csv.2, continue from there, iterate file from beginning, continue process until we met the checkpoint.
If we were interrupted during stage 2 or stage 4, it's ok, these two stage are both very time-efficient, just read the designated list and actual list, compare it and schedule a deletion.
My other thought is using tables in a sqllite database maybe for the state. Comitting using something like aiosqlite as you go moving from a simple todo and done table. OR just had a runid you use until finished. But I'm open to ideas.
The process seems sane. How do you want to structure the code? Refactor verify.py? I've always wanted to merge more of t into other classes and use the storage plugins directly for POSIX filesystem vs. other storage. Let's get a similiar agreement there before you get stuck into writing it. I'm happy to chat on discord too if you want more real time chatting it out.
I think all the things we do is to make the global info sharable in all stages of verify processes.
If the project accepts an external database or embed database in memory, I'd recommend those options:
The database file must be placed in POSIX filesystem as sqlite3/rocksdb on s3 would be inefficient.
About the code structure, I don't have many thoughts right now, I think maybe I can go create a demo first.
😄 I tried to use verify in my host and failed, it took 7 days and 20G+ memory and finally OOM killed by kernel.
There are a few problems that I want to solve:
packages
may have been changed during such long time, we should consider scan packages
first to make sure no file would be deleted wrongly
We've been using bandersnatch for years, recently, I found it's hard to actually run
bandersnatch verify
as we're using s3 as storage and it already contains millions of packages.Loading the JSON file alone would cost a day or longer, what makes it worse is that if this process is interrupted(system reboot, container killed, both are fairly common), all info would be lost, and you need a fresh start.
So I'm thinking about another way to verify packages.
iterating packages and verify if it should exist
list_objects_v2
API, 1000 package file a timebigdl_nano-2.2.0-py3-none-macosx_10_11_x86_64.whl
may belong tobigdl_nano
orbigdl-nano
no
, the package file should be deleted, if the project info exists in local fileThe good part about this method is it can be continued if interrupted, in list_objects_v2 you can use
ContinuationToken
to continue your iteration, also the memory usage is fairly low compared with storing all package info in memory.The bad part is how can we guess the project name, what if the package name does not follow the common rule, like a project named
project-1
, containsproject_2.whl
, in this case, we would never locate the correct project, the package file would be cleaned wrongly.It would be great if someone can tell me that this thing would never happen, like it's enforced in a way, or already obsolete that would never happen again.