Accessing the test suite SBML models without cloning repository

anandijain commented 3 years ago

This repository is ~400 MB including the git tree.

I am wondering if the developers have a way to access the cases without cloning the repo? Like an API?

fbergmann commented 3 years ago

we do have releases of the test suite available, where you could just download a zip file of all the cases, but it probably would not include the newer ones. we also have an online application, where you can select which test cases you'd like and download that archive from there:

http://raterule.caltech.edu/Facilities/Database/Home/Download

anandijain commented 3 years ago

Is there a programmatic way of getting all of them (including new)?

I am looking to use them as part of our testing infrastructure in https://github.com/anandijain/SBML.jl/. See https://github.com/LCSB-BioCore/SBML.jl/issues/31 for more discussion.

fbergmann commented 3 years ago

i would probably just grab the zip from git for that, that way you get all:

https://github.com/sbmlteam/sbml-test-suite/archive/refs/heads/master.zip

that way you can easily run over all.

anandijain commented 3 years ago

It's still 200 MB. The SBML itself is a small fraction of this, maybe 1%. This zip still contains all of the stuff I don't want. I just need cases/. This would make testing against the entire suite feasible on Continuous Integration/ GitHub Actions.

I would be happy to help in any way I can, I think it would make the suite much more accessible.

luciansmith commented 3 years ago

It looks like git (as always) is pretty behind on this sort of functionality, but happily, github provides svn access to its repositories as well, which has the functionality you need:

svn checkout https://github.com/sbmlteam/sbml-test-suite.git/trunk/cases

anandijain commented 3 years ago

Thanks this is the best solution so far. The only downside is it requires an SVN installation.

I'm wondering if (under the license), I'd be allowed to make my own repo that contains just the .xml files.

mhucka commented 3 years ago

Sorry I'm coming in to this a bit late. First, I fear there may be some confusion about what comprises a test case. The cases subdirectory takes up 155 MB uncompressed, and most of that is actually the SBML XML files. It's not "1%". Here's an example semantic test case:

00001-model.m
00001-plot.html
00001-results.csv
00001-sbml-l1v2.xml
00001-sbml-l2v1.xml
00001-sbml-l2v2.xml
00001-sbml-l2v3.xml
00001-sbml-l2v4.xml
00001-sbml-l2v5.xml
00001-sbml-l3v1.xml
00001-sbml-l3v2.xml
00001-settings.txt

Most of the files are necessary for the cases – the settings files and the model definition .m files at the very least. There are 1800+ semantic cases, so it simply adds up. The only way to reduce it would be to skip some level/version combinations, which can be a legitimate goal, of course, but then that's counter to the desire of getting all of them.

Unlike my colleague @luciansmith, I am sure there is a way to get just the cases using git without resorting to svn :-). Looking around, a method to get a whole subdirectory (which could perhaps be used to get all of cases/) can be found at this StackOverflow question, though I haven't had a chance to try it. There also appears to be a technique to get files by extension, based on this StackOverflow question and answers – perhaps that could be used to find all the SBML files for a particular level/version combo, plus (separately) the other files defining the case parameters.

The cases-only releases via github that we provided before did contain additional files and are admittedly large archives, even compressed. It's a fair question to ask how to get a minimal set for continuous integration/testing, so I think we should try to find a solution, but the minimal solution is probably going to clock in at 150+ MB when uncompressed. (Compressed, it would be much smaller, of course, especially if it didn't contain images of the plots and such, which is what the previous cases-only archives provided and the reason they're so large.)

@anandijain Is the size constraint coming from wanting to reduce the transmission size, or the on-disk uncompressed size?

luciansmith commented 3 years ago

Ha--the stack overflow page you found was the same one I found that suggested svn ;-) The other options all looked like various third-party software tools (though I might have missed one). The top answer says, "Git doesn't support this, but Github does via SVN," so my guess was that's what the third-party tools use under the hood.

In my hands, just compressing stuff on my own machine, I found that compressing just the folders and files from the test suite gave me a file ~100MB large, and compressing just the 'cases' directory gave me a file about 12MB. So there's definitely a lot of extra stuff in there if you only want the cases.

To your licensing question: the sbml-test-suite license is LGPL, which is fairly permissive. I work on libroadrunner, and for that tool, we've simply copied in the test suite into our own source tree under 'tests', and we periodically update that directory when the official version is updated. I've found that it's been helpful to check both the earliest and the latest level/version of each test with our simulator, so (for example) for a test that has l1v2, l2v1, l2v2 l2v3, l2v4, l2v5, l3v1, and l3v2 files, I'll check the l1v2 and the l3v2 versions. This revealed a handful of obscure bugs in our interpretation of l1 models that we wouldn't have found otherwise. It might be a good idea to go ahead and check all of them--I would probably suggest doing this at least once, just in case (though it's likely to take an absurdly long time to do so).

In the end, you might find that it works to fork the repository, delete the files you don't need from your copy (leaving the LICENSE etc. files in place, and describing what you did in a README, probably) and then pointing your github actions at your fork. It may turn out that this isn't actually much faster than having github actions point at the originals, but it at least gives you more control.

The other thing to consider is that you probably want your own fork anyway, since when the test suite updates, it might break your tools, and while that's good to know, it would be very confusing to have your own software start failing due to an innocuous change, only to find out that the actual reason is that new tests in the suite revealed some new problem that has nothing to do with your latest checkin. At the very least, I'd suggest pointing at a particular git commit/release so that updates don't surprise you.

mhucka commented 3 years ago

Ha, that's funny! OK, I see what you mean with that particular question, and the difficulty of doing this. I must admit I'm surprised. I looked through the other answers to that question and indeed it does seem to be difficult to use git for this.

After some attempts to use some of the suggestions people made, the farthest I got was to use the approaches involving the GitHub API. In particular, this answer describes using a combination of things to get download urls for files and then feed that to curl. The solution shown in that answer doesn't work for nested subdirectories, but it's clear it could be done. Basically, you end up with file names like the following, and doing curl on this to download the file:

https://api.github.com/repos/sbmlteam/sbml-test-suite/contents/cases/semantic/00001/00001-settings.txt

It wouldn't be hard to write a script to do this. If I had more time I'd do it right now. But in principle, this seems like a viable solution.

Regarding cloning/forking the test suite repo: that's fine and permitted of course, but if people set up standalone copies of the sbml test suite on GitHub, I just want to plead for putting clear notices that the copies are not the original/master, so that people who happen upon it by doing a search in GitHub don't get confused.

anandijain commented 3 years ago

Thank you all, I very much appreciate the help. I think a stripped-down fork clearly mentioning this as the original is probably the best solution. I may also try curling since generating the urls is easy.

My knowledge of SBML is pretty surface level. For context, we are looking to lower these models into an ODE using https://github.com/SciML/ModelingToolkit.jl/ .

I hadn't looked at the settings.txt files and I can see how they are used now. Are these a consistent format throughout the suite?

It's not clear to me why the .m files are needed.

@mhucka

luciansmith commented 3 years ago

I'm guessing that you're going to mostly need the 'semantic' tests; the description of their contents can be found:

https://github.com/sbmlteam/sbml-test-suite/blob/master/cases/semantic/README.md

But briefly: the settings file is indeed consistent throughout the test suite. The .m file tells you what test to perform (a time course vs. a flux balance analysis), and is important for context when you fail a test: it's the file that lists the features ('tags') that the test tests, so you know if you failed because you just don't support that feature, or if you failed due to a bug in your code. Finally, it describes in human-readable terms what's going on and what's special about that particular test, so, again, if you fail, you have some idea of where to start looking for why.

Some other information:

The 'syntactic' tests do not actually live in the 'syntactic' cases directory, but instead are generated from the libsbml source (where we use them to test that code) and are copied in in batches for particular releases. These files are mostly important if you intend to try to write your own SBML parser, or if you intend to not perform SBML validation on your models before attempting to simulate them: these all illustrate various invalid files that your parser/software might have to deal with.

The 'stochastic' tests test stochastic simulation, and, as such, are tested in a very different way from the 'single time course' semantic tests: you re-run them stochastically several times, and test departure from expected means and standard deviations.

anandijain commented 3 years ago

How often are new tests added? I ask to decide if I should write a script that stays up to date.

It looks like the git tree is by far the largest part of this repo. So basically, I'm just going to make own repo, delete the git tree, keep the license though. For now at least. https://github.com/anandijain/sbml-test-suite

luciansmith commented 3 years ago

Tests are not added very often. There are a handful that are in the 'develop' branch that we still need to release, but there tends to be a year or more between official releases.

luciansmith commented 2 years ago

So, I just created a new release for the test suite, and the release contains a zip file with just the 'semantic' directory:

https://github.com/sbmlteam/sbml-test-suite/releases/tag/3.4.0

I do think that your fork is probably the safest bet, but if you would like to use the zip instead, it's now there!

Closing this, as I think the issue has indeed been addressed. Thank you again!

sbmlteam / sbml-test-suite

Accessing the test suite SBML models without cloning repository #76