Closed BoPeng closed 6 years ago
Yes a proper solution to this problem would be very helpful for large scale analysis. But before diving into details, 1) how does nextflow
cope with it? 2) a data-base solution would help with number of files generated, but not size of signature (for large global variables which we may worry about later) and 3) if we have a single centralized data-base it can get large over time in terms of files and number of rows -- i guess this is why you said earlier that you believe filesystem is the most efficient database?
I can appreciate the challenges using databases, as you've listed. Another high level concern is that even if we can resolve all the problems one way or another, how robust and efficient do you think it's going to be?
A minor comment on current status:
because most users simply wrap up the project without performing sos remove -s or rm -rf .sos.
Even when they do it, given that files are stored in ~/.sos
we cannot easily rm -rf
the correct subset. tag
is useful, but I've not used sos purge
for tags (and not sure how it compares with rm -f
in terms of performance) ... we did not have it documented.
I think with the new_format
branch, as long as we can easily clean after ourselves (or somewhat automatically do it), we should be in a much better situation than before.
sos remove -s
is supposed to remove project related signatures but it might miss some files if the step signatures are missing. Overall sos remove -s
is not an efficient approach.Just for testing, I am writing signature files for all files in my anaconda directory,
On a mac pro with SSD drive:
$ python test.py
1000 1.2
2000 2.5
3000 3.8
4000 5.1
5000 6.5
6000 7.9
164000 252.6
165000 254.2
166000 255.9
167000 257.7
168000 263.7
169000 273.9
170000 291.2
which takes about 1.3 second per 1000 files. 1.7 seconds with 168k files. It can take up to 10 seconds, perhaps because of file size. Overall 178k files took 358.6 seconds (5 minutes).
On a cluster with NFS storage, I am seeing:
$ python test.py
1000 7.0
2000 13.0
3000 18.7
4000 24.7
5000 29.5
6000 36.3
28000 182.6
29000 188.5
Traceback (most recent call last):
File "test.py", line 10, in <module>
file_target(os.path.join(root, file)).write_sig()
File "/scratch/bcb/bpeng1/anaconda3/lib/python3.6/site-packages/sos/targets.py", line 634, in write_sig
f'{self.fullname()}\t{os.path.getmtime(self)}\t{os.path.getsize(self)}\t{self.target_signature()}\n')
File "/scratch/bcb/bpeng1/anaconda3/lib/python3.6/genericpath.py", line 55, in getmtime
return os.stat(filename).st_mtime
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/bcb/bpeng1/anaconda3/pkgs/matplotlib-2.1.0-py36hba5de38_0/lib/libtk.so'
crashes with 29k files, but the problem is on reading, not writing of signature file.
I assume the file size also matters to the benchmark? 29K files in 3 minutes is acceptable -- does it mean that for an analyze involving all 30K genes it only needs 3 minutes to check signature and determine rerun or not?
This is only for saving and I have not checked the performance of checking signatures. Anaconda files are generally small so the calculation of md5 should be fast.
I see. I guess what matters to user experience is signature check / skip when they rerun: suppose out of 30K jobs only 100 failed. In order to submit the 100 failed, it is acceptable to wait for maybe around 5min for SoS to figure out what to resubmit. Saving signature at the end of a running task is perhaps too trivial compared to the computation itself.
How do we handle large global variables? We save its md5sum, right?
I'm currently halfway at running a cluster job. It now has 1067 .task
files in ~/.sos
folder with total size of 32M. So 100K files will be 3.2G, which perhaps is not too bad.
Let us do some benchmark first. We could check only file size and mtime when verifying signature, and use md5 only when file is accessed. We however do not know how fast it is to retrieve mtime
and size
from signature file.
Using a very dumb sqlite method that use full process-lock for each write operation, and each time open database, insert or replace signature, commit and save.
$ python test.py
1000 7.2
2000 11.9
3000 16.8
4000 21.7
5000 26.4
6000 30.6
7000 34.7
8000 39.0
9000 43.5
10000 48.1
...
173000 688.0
174000 691.5
175000 695.2
176000 698.8
177000 702.6
178000 706.2
Writing 178819 signatures took 709.2 seconds
So overall is only 2 times slower for 178k files. This is on SSD with supposedly fast random access. The database size is 43M.
On NFS I have (still going)
$ python test.py
1000 16.1
2000 31.5
3000 46.3
4000 61.7
5000 75.8
6000 92.2
7000 110.8
8000 127.5
9000 143.3
24000 380.5
25000 397.4
26000 413.7
27000 429.2
28000 445.1
29000 460.2
also crashes at that file so that file might be strange (a link etc). The performance is 2 times worse than file based, and 4 times worse than SSD.
The task signatures are currently pickled and compressed in .task file, so it is a lot smaller than before especially for large variables.
so it is a lot smaller than before especially for large variables.
Sometimes a variable can be really large: eg global variable loading a large data-matrix. It would help if we only keep the partial md5sum of it.
So overall is only 2 times slower for 178k files.
Not bad! But this is on SSD and our previous experience suggests much slower performance on network file systems.
The query for task signature is really simple, right? Basically just dicitionary-like key-value? Are there more efficient database implementation for the specific need?
Sometimes a variable can be really large: eg global variable loading a large data-matrix. It would help if we only keep the partial md5sum of it.
We do not do md5 on variables. Whereas we can take only signature of input variables, output variables are turned for skipped steps so have to have all the information saved.
Whereas we can take only signature of input variables
I think signature of large input variable makes sense. I'm worried about those big matrices in my global section.
RE database -- A quick search shows this project:
and some benchmarks:
http://www.lmdb.tech/bench/microbench/benchmark.html
and a derivative:
https://github.com/facebook/rocksdb
Also seems relevant is LMDB: https://lmdb.readthedocs.io/en/release Some argue this is better than levelDB. But at least its compression is worse: https://www.influxdata.com/blog/benchmarking-leveldb-vs-rocksdb-vs-hyperleveldb-vs-lmdb-performance-for-influxdb/
OK, if I do not open/close the database to cause a write each time,
$ python test.py
1000 0.3
2000 0.7
3000 1.0
176000 109.0
177000 109.6
178000 110.0
Writing 178819 signatures took 110.2 seconds
Without calculation of md5 (only save date and size)
177000 18.2
178000 18.2
Writing 178819 signatures took 18.3 seconds
NFS:
$ python test.py
1000 2.7
2000 4.7
3000 6.8
20000 59.2
21000 63.6
22000 69.2
23000 74.1
24000 78.3
25000 83.5
26000 88.5
27000 92.8
28000 97.0
29000 100.7
without md5
15000 9.0
16000 10.0
17000 10.9
18000 11.7
19000 12.5
20000 13.4
21000 14.3
22000 15.7
23000 16.8
24000 17.5
25000 18.6
26000 19.6
27000 20.4
28000 21.2
29000 21.8
My reservation with leveldb is that it requires an extra dependency and my attempt to install plyvel just failed on mac.
So I think with some careful implementation, the sqlite version should work at least as fast as the file version. By careful implementation I mean no open/close database for each signature checking. This basically means that the database should be carefully coordinated among workers.
my attempt to install plyvel just failed on mac.
This is too bad. My Linux installation worked instantly ...
My concern with sqlite is still related to its size. As far as I can recall we have problem in vtools when the file gets large. Deleting old tasks will not shrink size of sqlite. Not sure about leveldb but at least it uses some compression. On the other hand, maybe we've already compressed enough that additional compression will not matter?
So I think with some careful implementation, the sqlite version should work at least as fast as the file version.
Looks like it is faster than file based version now?
We are still talking strictly file signatures with strict structure. If 160k files takes 50m, I do not see a problem with file size.
If 160k files takes 50m, I do not see a problem with file size.
I'll see if that is the case without replacing my large global variable with the md5sum.
Okay so far I've got 4759 task files in ~/.sos/tasks
. The total size is 258M. This is because of the large variable global parameter saved in every step as is, not in md5sum. Should i open a separate ticket to discuss use md5sum for non-atomic variables?
Yes.
On Tue, Aug 7, 2018 at 7:08 PM gaow notifications@github.com wrote:
Okay so far I've got 4759 task files in ~/.sos/tasks. The total size is 258M. This is because of the large variable global parameter saved in every step as is, not in md5sum. Should i open a separate ticket to discuss use md5sum for non-atomic variables?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vatlab/SoS/issues/1004#issuecomment-411241534, or mute the thread https://github.com/notifications/unsubscribe-auth/AJbmID84MzFAXQMVe4w3In-bESBBuO6pks5uOiv3gaJpZM4VbN-c .
-- Sent from Gmail Mobile
I did not get a chance earlier today to read more on sqlite4. Just read it more carefully. It claims in the documentation it is faster than levelDB. The Python interface seems straightforward. Installation on my Linux seems to work.
leveldb does not support multiprocessing, so it is ruled out automatically.
In the end sqlite3 is most mature and I am getting an implementation that create signatures of 180k files in 283s, very close to file-based signatures but with a single ~/.sos/signatures.db
file. Validating the same 180k files took 22s because validation now checks mtime and size before calculating md5. I have done some stress testing (read/write from multiple processes) and it appears that the sqlite solution is quite robust to multiprocessing.
Not sure about sqlite4, the python binding you sent me is for lsm-db, not sqlite4.
https://sqlite.org/src4/artifact/56683d66cbd41c2e
All development work on SQLite4 has ended. The experiment has concluded.
Lessons learned from SQLite4 have been folded into SQLite3 which continues to be actively maintained and developed. This repository exists as an historical record. There are no plans at this time to resume development of SQLite4.
Validating the same 180k files took 22s because validation now checks mtime and size before calculating md5.
this is smart! What is mtime check exactly? I thought initially you disliked time stamps because you think it is an important feature that SoS tasks are workflow independent.
All development work on SQLite4 has ended. The experiment has concluded.
yes ... they should have updated the sqlite4 overview page with this info! It seems the lsm1 extension code is how sqlite4 now folds in sqlite3.
Okay, I see the logic now. I like this change!
Are you worried about the "sqlite database disk image is malformed" issue? It used to happen from time to time in the vtools project although I recall the solution is almost always simply one line of some dump command, that is, nothing was really corrupted. Hopefully sqlite is a lot more robust now after all these years?
It was almost always user error, and the "transaction" is much smaller than vtools so there is far less a chance for traffic jam. I tested using 10 processes to process 180k files and there was no problem.
workflow signatures (.sig
) are also stored into sqlite database because they will also have relatively small size. The remaining signature type is step signature (.exe_info
), which is the one that you are concerned because it can contain large input variables.
Great, I just tried it out locally, it was flawless on my test data. Cannot wait to apply this because we are soon to hit file count quota again but I still have ongoing jobs on the cluster. I shall try it later (or should i wait a bit for more improvements?).
BTW here is the file count limitation for a GPFS partition where i run my projects:
fileset type used quota limit grace
files (group) 1009794 1035800 1139380 none
I will work on step signature today.
Cool! I'm wondering if they should be in the same db
file or separate? From my yesterday's work alone:
$ ls .runtime | wc -l
55728
$ du .runtime -h --max-depth 1
463M .runtime
that's why SoS is pretty hostile (for now) for systems with a file quota constraint :P
Signature files could be combined, but lock files cannot. Citing http://lists.openstack.org/pipermail/openstack-dev/2015-November/080834.html,
Second, is this actually a problem? Modern filesystems have absurdly large limits on the number of files in a directory, so it's highly unlikely we would ever exhaust that, and we're creating all zero byte files so there shouldn't be a significant space impact either. In the past I believe our recommendation has been to simply create a cleanup job that runs on boot, before any of the OpenStack services start, that deletes all of the lock files. At that point you know it's safe to delete them, and it prevents your lock file directory from growing forever.
Okay but are lock files temporary (removed after execution) and there are only limited number of lock files generated during execution compared to number of workflow / step signatures? There is no problem leaving behind some files at all if not too many, at least not let SoS stand out (though actually the one who initially complained to me about file counts was because the .sos
folder got added and pushed to github).
Seems like the cluster here is not that modern :) it is however the best cluster system I've worked with so far in terms of capacity, robustness and support.
No. The lock files will accumulate if not cleaned up. I am trying to remove lock files after they are unlocked, but this might lead to obscure bugs later (another process might be trying to lock it at the same time etc). I will keep this in mind for better alternatives.
but this might lead to obscure bugs later
It sounds worrisome. What if lock files are saved in a tempdir that removes itself after the entire sos
command completes?
This sounds like a good idea....
So all the signatures are in their corresponding sqlite files, and there should be substantially smaller number of small files. The rest would be bug fixing because the changes are pretty extensive.
Great! I confirm it works for some simple toy example on my desktop:
$ ll .sos/
total 196K
-rw-r--r-- 1 gaow gaow 79K Aug 8 12:41 transcript.txt
-rw-r--r-- 1 gaow gaow 12K Aug 8 12:41 step_signatures.db
-rw-r--r-- 1 gaow gaow 100K Aug 8 12:41 workflow_signatures.db
Using earlier version of SoS on the same example:
$ ll .sos/
du -h .sos/.runtime/
368K .sos/.runtime/
$ ls .sos/.runtime/ | wc -l
243
Before I test on the cluster for my next batch of task
jobs: do i have to update sos-pbs
? Are the job status check feature also in place?
There is still a trascript.txt
file, which basically record what script-running actions execute. Does it make sense to include it in workflow signatures and report it in workflow report?
There is still a trascript.txt file
I prefer to have it where it is. It is just one file and is useful for quick debug now that we removed intermediate scripts. Sometimes the bug from one script is actually due to a bug in the upstream script. When that happens I find trascript.txt
quite handy.
This file is not documented and looks like a hidden feature. It is not lock-protected so there is slight chance of conflict (but the consequence is not big), and it collects scripts from all workflows without identification.
I mean it can be a hidden convenience feature but we should do more if we are going to disclose it.
Anyway, all tests pass so I will assume that things are working more or less ok. The changes use new set of files so there should be no problem with backward compatibility. Please feel free to use it and let me know if you notice any problem.
Wonderful. But I guess i'll start by rebuilding signature for my existing run (and possible other earlier runs) to clean up those .sos
folder i can find on the cluster and see file counts reduced. I will report issues if i run into any at that stage.
@BoPeng Does Python ship with sqlite3
or it has to be installed? If it is the latter perhaps the dependency should be included in setup.py
?
sqlite3 is a builtin module.
A few comments:
I notice SoS still tries to create ~/.sos/task
. Is that necessary?
I just rebuilt signature for my current project. the ~/.sos/task
used to have 4984 files of total size 578mb. Now it is a single 9Mb target_signatures.db
file. the .sos
used to contain > 50K signature / lock files with > 500 Mb total size but now it is 19Mb for workflow signature db and 17Mb for step signature db. It is quite an improvement! Will try running other projects from scratch and see what happens.
Constructing these signature ("ignored with signatures constructed") took 2882 seconds. It looks like within each step the concurrent
signatures are still reconstructed sequentially, about 5 records per sec. It is not in parallel (?).
Is there a concern / solution RE the possibility that the sqlite database gets too big over time?
~/.sos/tasks
, yes it is still needed../.sos
will be removed with the removal of the project, and ~/.sos
will contain only global signatures, which is small in quantity.3 is possible but anything related to signature locking would be troublesome.. I will leave this for later.
You meant ~/.sos/tasks, yes it is still needed.
Looks like it will be emptied after use?
~/.sos will contain only global signatures, which is small in quantity.
Let's hope it does not grow too big. My 5K task DB is 9Mb. Is it true that only completed tasks have their signature stored? Or, failed tasks will be replaced by complete status when they are complete? Unlike on a file system, entries removed from SQLite will not automatically shrink its size accordingly.
3 is possible but anything related to signature locking would be troublesome.. I will leave this for later.
Sure. I guess we can close this ticket if it seems to work for now. I can use seperate ticket for it.
So far so good. I've tried more job submission, including recovering from failed jobs. I have also ran some local non-task workflows. Seems to work flawlessly for me.
Shall we release a new version to get others (from my end) to test, and fix potential problems from there?
Currently we basically create a signature file for each input and output file. Most of them are under
.sos/.runtime
but some global files have signatures under~/.sos/.runtime
. The signatures contain file size, last modified date, and md5 checksum. They are used to track if a file has been changed. The signature files are shared by workflows in the sense that different workflows will have access to the signatures of the same files. The signatures can be removed by commandsos remove -s
.The problems with this approach is that the number of small signature files can accumulate over time because most users simply wrap up the project without performing
sos remove -s
orrm -rf .sos
. Also, for projects that deals with a lot of input and output files, the time to handle signatures can exceed the time to process the files. It has been suggested that we can use timestamps instead of checksum, but the problem in saving signatures remain.It is possible to use a database approach to properly organize the signatures. However, there can be multiple sos instances and sos itself runs in a multi-processing mode. Frequent locking and unlocking to safely read/write such file-based database (e.g. sqlite) can be troublesome. There are several solutions:
Note that we did not consider a server-based approach because our tasks use file-based signatures but they can be executed on remote hosts that might not be able to communicate with the (redis) server. However, with the new in memory signature specifically designed for tasks (in the new format branch), we are in a much better position to handle the signature problems, and the issues are mostly technical. That is to say,
sos pack
command), and how to do it if there is such a need.