vatlab / sos

SoS workflow system for daily data analysis
http://vatlab.github.io/sos-docs
BSD 3-Clause "New" or "Revised" License
274 stars 45 forks source link

Re-organization of signature files. #1004

Closed BoPeng closed 6 years ago

BoPeng commented 6 years ago

Currently we basically create a signature file for each input and output file. Most of them are under .sos/.runtime but some global files have signatures under ~/.sos/.runtime. The signatures contain file size, last modified date, and md5 checksum. They are used to track if a file has been changed. The signature files are shared by workflows in the sense that different workflows will have access to the signatures of the same files. The signatures can be removed by command sos remove -s.

The problems with this approach is that the number of small signature files can accumulate over time because most users simply wrap up the project without performing sos remove -s or rm -rf .sos. Also, for projects that deals with a lot of input and output files, the time to handle signatures can exceed the time to process the files. It has been suggested that we can use timestamps instead of checksum, but the problem in saving signatures remain.

It is possible to use a database approach to properly organize the signatures. However, there can be multiple sos instances and sos itself runs in a multi-processing mode. Frequent locking and unlocking to safely read/write such file-based database (e.g. sqlite) can be troublesome. There are several solutions:

  1. Let processes to lock/unlock the database, which might not be too bad.
  2. Use a dedicated process (e.g. the sos master) to read/write to the database. For example, the master sos process can create a thread that answer to a port, essentially acting as a server.
  3. Use a dedicated server such as redis. The master process would start a redis server and listen to requests from the workers.

Note that we did not consider a server-based approach because our tasks use file-based signatures but they can be executed on remote hosts that might not be able to communicate with the (redis) server. However, with the new in memory signature specifically designed for tasks (in the new format branch), we are in a much better position to handle the signature problems, and the issues are mostly technical. That is to say,

  1. how to share the same server among multiple sos instances.
  2. how to design a policy to remove "unused" signatures.
  3. if there is a need to bundle signatures with a project (the now-defunct sos pack command), and how to do it if there is such a need.
  4. how to handle sos notebook when forking a server process can be problematic
gaow commented 6 years ago

Yes a proper solution to this problem would be very helpful for large scale analysis. But before diving into details, 1) how does nextflow cope with it? 2) a data-base solution would help with number of files generated, but not size of signature (for large global variables which we may worry about later) and 3) if we have a single centralized data-base it can get large over time in terms of files and number of rows -- i guess this is why you said earlier that you believe filesystem is the most efficient database?

I can appreciate the challenges using databases, as you've listed. Another high level concern is that even if we can resolve all the problems one way or another, how robust and efficient do you think it's going to be?

A minor comment on current status:

because most users simply wrap up the project without performing sos remove -s or rm -rf .sos.

Even when they do it, given that files are stored in ~/.sos we cannot easily rm -rf the correct subset. tag is useful, but I've not used sos purge for tags (and not sure how it compares with rm -f in terms of performance) ... we did not have it documented.

I think with the new_format branch, as long as we can easily clean after ourselves (or somewhat automatically do it), we should be in a much better situation than before.

BoPeng commented 6 years ago
  1. do not know. It is not written in java so it is a bit difficult for me to check.
  2. there are two types of signatures, one is the step signature with the large variables etc. I do not plan to change it now because they are "larger" and arguably in a smaller quantity. The signatures I am talking about in this ticket are strictly file signatures to track the change of files. They are small and have fixed format.
  3. filesystem-based "database" is easier to work with because you can use filesystem commands. In this case, if all we save are file signatures (filename, time, size, md5). The database will not grow very big even with millions of records. This is substantially different from millions of small files and there is less a need to clean them.
  4. Right now the efficiency depends on file system speed, which varies a lot. The database solution might be slow but at least redis is known to be fast for our "regular" shaped data.
  5. tags are for tasks, they are now out of the scope with the new format that stores signatures with the tasks. sos remove -s is supposed to remove project related signatures but it might miss some files if the step signatures are missing. Overall sos remove -s is not an efficient approach.
BoPeng commented 6 years ago

Just for testing, I am writing signature files for all files in my anaconda directory,

On a mac pro with SSD drive:

$ python test.py
1000  1.2
2000  2.5
3000  3.8
4000  5.1
5000  6.5
6000  7.9

164000  252.6
165000  254.2
166000  255.9
167000  257.7
168000  263.7
169000  273.9
170000  291.2

which takes about 1.3 second per 1000 files. 1.7 seconds with 168k files. It can take up to 10 seconds, perhaps because of file size. Overall 178k files took 358.6 seconds (5 minutes).

On a cluster with NFS storage, I am seeing:

$ python test.py
1000  7.0
2000  13.0
3000  18.7
4000  24.7
5000  29.5
6000  36.3

28000  182.6
29000  188.5
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    file_target(os.path.join(root, file)).write_sig()
  File "/scratch/bcb/bpeng1/anaconda3/lib/python3.6/site-packages/sos/targets.py", line 634, in write_sig
    f'{self.fullname()}\t{os.path.getmtime(self)}\t{os.path.getsize(self)}\t{self.target_signature()}\n')
  File "/scratch/bcb/bpeng1/anaconda3/lib/python3.6/genericpath.py", line 55, in getmtime
    return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/bcb/bpeng1/anaconda3/pkgs/matplotlib-2.1.0-py36hba5de38_0/lib/libtk.so'

crashes with 29k files, but the problem is on reading, not writing of signature file.

gaow commented 6 years ago

I assume the file size also matters to the benchmark? 29K files in 3 minutes is acceptable -- does it mean that for an analyze involving all 30K genes it only needs 3 minutes to check signature and determine rerun or not?

BoPeng commented 6 years ago

This is only for saving and I have not checked the performance of checking signatures. Anaconda files are generally small so the calculation of md5 should be fast.

gaow commented 6 years ago

I see. I guess what matters to user experience is signature check / skip when they rerun: suppose out of 30K jobs only 100 failed. In order to submit the 100 failed, it is acceptable to wait for maybe around 5min for SoS to figure out what to resubmit. Saving signature at the end of a running task is perhaps too trivial compared to the computation itself.

gaow commented 6 years ago

How do we handle large global variables? We save its md5sum, right?

I'm currently halfway at running a cluster job. It now has 1067 .task files in ~/.sos folder with total size of 32M. So 100K files will be 3.2G, which perhaps is not too bad.

BoPeng commented 6 years ago

Let us do some benchmark first. We could check only file size and mtime when verifying signature, and use md5 only when file is accessed. We however do not know how fast it is to retrieve mtime and size from signature file.

BoPeng commented 6 years ago

Using a very dumb sqlite method that use full process-lock for each write operation, and each time open database, insert or replace signature, commit and save.

$ python test.py
1000  7.2
2000  11.9
3000  16.8
4000  21.7
5000  26.4
6000  30.6
7000  34.7
8000  39.0
9000  43.5
10000  48.1
...
173000  688.0
174000  691.5
175000  695.2
176000  698.8
177000  702.6
178000  706.2
Writing 178819 signatures took  709.2 seconds

So overall is only 2 times slower for 178k files. This is on SSD with supposedly fast random access. The database size is 43M.

On NFS I have (still going)

$ python test.py
1000  16.1
2000  31.5
3000  46.3
4000  61.7
5000  75.8
6000  92.2
7000  110.8
8000  127.5
9000  143.3

24000  380.5
25000  397.4
26000  413.7
27000  429.2
28000  445.1
29000  460.2

also crashes at that file so that file might be strange (a link etc). The performance is 2 times worse than file based, and 4 times worse than SSD.

BoPeng commented 6 years ago

The task signatures are currently pickled and compressed in .task file, so it is a lot smaller than before especially for large variables.

gaow commented 6 years ago

so it is a lot smaller than before especially for large variables.

Sometimes a variable can be really large: eg global variable loading a large data-matrix. It would help if we only keep the partial md5sum of it.

So overall is only 2 times slower for 178k files.

Not bad! But this is on SSD and our previous experience suggests much slower performance on network file systems.

The query for task signature is really simple, right? Basically just dicitionary-like key-value? Are there more efficient database implementation for the specific need?

BoPeng commented 6 years ago

Sometimes a variable can be really large: eg global variable loading a large data-matrix. It would help if we only keep the partial md5sum of it.

We do not do md5 on variables. Whereas we can take only signature of input variables, output variables are turned for skipped steps so have to have all the information saved.

gaow commented 6 years ago

Whereas we can take only signature of input variables

I think signature of large input variable makes sense. I'm worried about those big matrices in my global section.

RE database -- A quick search shows this project:

http://leveldb.org/

and some benchmarks:

http://www.lmdb.tech/bench/microbench/benchmark.html

and a derivative:

https://github.com/facebook/rocksdb

Also seems relevant is LMDB: https://lmdb.readthedocs.io/en/release Some argue this is better than levelDB. But at least its compression is worse: https://www.influxdata.com/blog/benchmarking-leveldb-vs-rocksdb-vs-hyperleveldb-vs-lmdb-performance-for-influxdb/

BoPeng commented 6 years ago

OK, if I do not open/close the database to cause a write each time,

$ python test.py
1000  0.3
2000  0.7
3000  1.0
176000  109.0
177000  109.6
178000  110.0
Writing 178819 signatures took  110.2 seconds

Without calculation of md5 (only save date and size)

177000  18.2
178000  18.2
Writing 178819 signatures took  18.3 seconds

NFS:

$ python test.py
1000  2.7
2000  4.7
3000  6.8

20000  59.2
21000  63.6
22000  69.2
23000  74.1
24000  78.3
25000  83.5
26000  88.5
27000  92.8
28000  97.0
29000  100.7

without md5

15000  9.0
16000  10.0
17000  10.9
18000  11.7
19000  12.5
20000  13.4
21000  14.3
22000  15.7
23000  16.8
24000  17.5
25000  18.6
26000  19.6
27000  20.4
28000  21.2
29000  21.8
BoPeng commented 6 years ago

My reservation with leveldb is that it requires an extra dependency and my attempt to install plyvel just failed on mac.

So I think with some careful implementation, the sqlite version should work at least as fast as the file version. By careful implementation I mean no open/close database for each signature checking. This basically means that the database should be carefully coordinated among workers.

gaow commented 6 years ago

my attempt to install plyvel just failed on mac.

This is too bad. My Linux installation worked instantly ...

My concern with sqlite is still related to its size. As far as I can recall we have problem in vtools when the file gets large. Deleting old tasks will not shrink size of sqlite. Not sure about leveldb but at least it uses some compression. On the other hand, maybe we've already compressed enough that additional compression will not matter?

So I think with some careful implementation, the sqlite version should work at least as fast as the file version.

Looks like it is faster than file based version now?

BoPeng commented 6 years ago

We are still talking strictly file signatures with strict structure. If 160k files takes 50m, I do not see a problem with file size.

gaow commented 6 years ago

If 160k files takes 50m, I do not see a problem with file size.

I'll see if that is the case without replacing my large global variable with the md5sum.

gaow commented 6 years ago

Okay so far I've got 4759 task files in ~/.sos/tasks. The total size is 258M. This is because of the large variable global parameter saved in every step as is, not in md5sum. Should i open a separate ticket to discuss use md5sum for non-atomic variables?

BoPeng commented 6 years ago

Yes.

On Tue, Aug 7, 2018 at 7:08 PM gaow notifications@github.com wrote:

Okay so far I've got 4759 task files in ~/.sos/tasks. The total size is 258M. This is because of the large variable global parameter saved in every step as is, not in md5sum. Should i open a separate ticket to discuss use md5sum for non-atomic variables?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vatlab/SoS/issues/1004#issuecomment-411241534, or mute the thread https://github.com/notifications/unsubscribe-auth/AJbmID84MzFAXQMVe4w3In-bESBBuO6pks5uOiv3gaJpZM4VbN-c .

-- Sent from Gmail Mobile

gaow commented 6 years ago

I did not get a chance earlier today to read more on sqlite4. Just read it more carefully. It claims in the documentation it is faster than levelDB. The Python interface seems straightforward. Installation on my Linux seems to work.

BoPeng commented 6 years ago

leveldb does not support multiprocessing, so it is ruled out automatically.

In the end sqlite3 is most mature and I am getting an implementation that create signatures of 180k files in 283s, very close to file-based signatures but with a single ~/.sos/signatures.db file. Validating the same 180k files took 22s because validation now checks mtime and size before calculating md5. I have done some stress testing (read/write from multiple processes) and it appears that the sqlite solution is quite robust to multiprocessing.

Not sure about sqlite4, the python binding you sent me is for lsm-db, not sqlite4.

BoPeng commented 6 years ago

https://sqlite.org/src4/artifact/56683d66cbd41c2e

All development work on SQLite4 has ended. The experiment has concluded.

Lessons learned from SQLite4 have been folded into SQLite3 which continues to be actively maintained and developed. This repository exists as an historical record. There are no plans at this time to resume development of SQLite4.

gaow commented 6 years ago

Validating the same 180k files took 22s because validation now checks mtime and size before calculating md5.

this is smart! What is mtime check exactly? I thought initially you disliked time stamps because you think it is an important feature that SoS tasks are workflow independent.

All development work on SQLite4 has ended. The experiment has concluded.

yes ... they should have updated the sqlite4 overview page with this info! It seems the lsm1 extension code is how sqlite4 now folds in sqlite3.

BoPeng commented 6 years ago

mtime checks last modified time. The new logic is here.

gaow commented 6 years ago

Okay, I see the logic now. I like this change!

Are you worried about the "sqlite database disk image is malformed" issue? It used to happen from time to time in the vtools project although I recall the solution is almost always simply one line of some dump command, that is, nothing was really corrupted. Hopefully sqlite is a lot more robust now after all these years?

BoPeng commented 6 years ago

It was almost always user error, and the "transaction" is much smaller than vtools so there is far less a chance for traffic jam. I tested using 10 processes to process 180k files and there was no problem.

BoPeng commented 6 years ago

workflow signatures (.sig) are also stored into sqlite database because they will also have relatively small size. The remaining signature type is step signature (.exe_info), which is the one that you are concerned because it can contain large input variables.

gaow commented 6 years ago

Great, I just tried it out locally, it was flawless on my test data. Cannot wait to apply this because we are soon to hit file count quota again but I still have ongoing jobs on the cluster. I shall try it later (or should i wait a bit for more improvements?).

BTW here is the file count limitation for a GPFS partition where i run my projects:

fileset          type                   used      quota      limit    grace
                  files  (group)      1009794    1035800    1139380     none
BoPeng commented 6 years ago

I will work on step signature today.

gaow commented 6 years ago

Cool! I'm wondering if they should be in the same db file or separate? From my yesterday's work alone:

$  ls .runtime | wc -l
55728
$ du .runtime -h --max-depth 1
463M    .runtime

that's why SoS is pretty hostile (for now) for systems with a file quota constraint :P

BoPeng commented 6 years ago

Signature files could be combined, but lock files cannot. Citing http://lists.openstack.org/pipermail/openstack-dev/2015-November/080834.html,

Second, is this actually a problem? Modern filesystems have absurdly large limits on the number of files in a directory, so it's highly unlikely we would ever exhaust that, and we're creating all zero byte files so there shouldn't be a significant space impact either. In the past I believe our recommendation has been to simply create a cleanup job that runs on boot, before any of the OpenStack services start, that deletes all of the lock files. At that point you know it's safe to delete them, and it prevents your lock file directory from growing forever.

gaow commented 6 years ago

Okay but are lock files temporary (removed after execution) and there are only limited number of lock files generated during execution compared to number of workflow / step signatures? There is no problem leaving behind some files at all if not too many, at least not let SoS stand out (though actually the one who initially complained to me about file counts was because the .sos folder got added and pushed to github).

Seems like the cluster here is not that modern :) it is however the best cluster system I've worked with so far in terms of capacity, robustness and support.

BoPeng commented 6 years ago

No. The lock files will accumulate if not cleaned up. I am trying to remove lock files after they are unlocked, but this might lead to obscure bugs later (another process might be trying to lock it at the same time etc). I will keep this in mind for better alternatives.

gaow commented 6 years ago

but this might lead to obscure bugs later

It sounds worrisome. What if lock files are saved in a tempdir that removes itself after the entire sos command completes?

BoPeng commented 6 years ago

This sounds like a good idea....

BoPeng commented 6 years ago

So all the signatures are in their corresponding sqlite files, and there should be substantially smaller number of small files. The rest would be bug fixing because the changes are pretty extensive.

gaow commented 6 years ago

Great! I confirm it works for some simple toy example on my desktop:

$ ll .sos/
total 196K
-rw-r--r-- 1 gaow gaow  79K Aug  8 12:41 transcript.txt
-rw-r--r-- 1 gaow gaow  12K Aug  8 12:41 step_signatures.db
-rw-r--r-- 1 gaow gaow 100K Aug  8 12:41 workflow_signatures.db

Using earlier version of SoS on the same example:

$ ll .sos/
du -h .sos/.runtime/
368K    .sos/.runtime/
$ ls .sos/.runtime/ | wc -l
243

Before I test on the cluster for my next batch of task jobs: do i have to update sos-pbs? Are the job status check feature also in place?

BoPeng commented 6 years ago

There is still a trascript.txt file, which basically record what script-running actions execute. Does it make sense to include it in workflow signatures and report it in workflow report?

gaow commented 6 years ago

There is still a trascript.txt file

I prefer to have it where it is. It is just one file and is useful for quick debug now that we removed intermediate scripts. Sometimes the bug from one script is actually due to a bug in the upstream script. When that happens I find trascript.txt quite handy.

BoPeng commented 6 years ago

This file is not documented and looks like a hidden feature. It is not lock-protected so there is slight chance of conflict (but the consequence is not big), and it collects scripts from all workflows without identification.

I mean it can be a hidden convenience feature but we should do more if we are going to disclose it.

BoPeng commented 6 years ago

Anyway, all tests pass so I will assume that things are working more or less ok. The changes use new set of files so there should be no problem with backward compatibility. Please feel free to use it and let me know if you notice any problem.

gaow commented 6 years ago

Wonderful. But I guess i'll start by rebuilding signature for my existing run (and possible other earlier runs) to clean up those .sos folder i can find on the cluster and see file counts reduced. I will report issues if i run into any at that stage.

gaow commented 6 years ago

@BoPeng Does Python ship with sqlite3 or it has to be installed? If it is the latter perhaps the dependency should be included in setup.py?

BoPeng commented 6 years ago

sqlite3 is a builtin module.

gaow commented 6 years ago

A few comments:

  1. I notice SoS still tries to create ~/.sos/task. Is that necessary?

  2. I just rebuilt signature for my current project. the ~/.sos/task used to have 4984 files of total size 578mb. Now it is a single 9Mb target_signatures.db file. the .sos used to contain > 50K signature / lock files with > 500 Mb total size but now it is 19Mb for workflow signature db and 17Mb for step signature db. It is quite an improvement! Will try running other projects from scratch and see what happens.

  3. Constructing these signature ("ignored with signatures constructed") took 2882 seconds. It looks like within each step the concurrent signatures are still reconstructed sequentially, about 5 records per sec. It is not in parallel (?).

  4. Is there a concern / solution RE the possibility that the sqlite database gets too big over time?

BoPeng commented 6 years ago
  1. You meant ~/.sos/tasks, yes it is still needed.
  2. Good to know.
  3. The substeps are checked before they are submitted, I can have a look though.
  4. Yes, but I assume ./.sos will be removed with the removal of the project, and ~/.sos will contain only global signatures, which is small in quantity.
BoPeng commented 6 years ago

3 is possible but anything related to signature locking would be troublesome.. I will leave this for later.

gaow commented 6 years ago

You meant ~/.sos/tasks, yes it is still needed.

Looks like it will be emptied after use?

~/.sos will contain only global signatures, which is small in quantity.

Let's hope it does not grow too big. My 5K task DB is 9Mb. Is it true that only completed tasks have their signature stored? Or, failed tasks will be replaced by complete status when they are complete? Unlike on a file system, entries removed from SQLite will not automatically shrink its size accordingly.

3 is possible but anything related to signature locking would be troublesome.. I will leave this for later.

Sure. I guess we can close this ticket if it seems to work for now. I can use seperate ticket for it.

gaow commented 6 years ago

So far so good. I've tried more job submission, including recovering from failed jobs. I have also ran some local non-task workflows. Seems to work flawlessly for me.

Shall we release a new version to get others (from my end) to test, and fix potential problems from there?