openml / OpenML

Open Machine Learning
https://openml.org
BSD 3-Clause "New" or "Revised" License
658 stars 91 forks source link

Test Server Synchronisation #386

Open janvanrijn opened 7 years ago

janvanrijn commented 7 years ago

I have the impression that the Test Server is hopelessly out of sync with the live server. For example, most of the UCI datasets are not available or linked correctly, e.g.: https://test.openml.org/data/download/2/anneal https://test.openml.org/data/download/24/mushroom https://test.openml.org/data/download/42/soybean

This makes large scale testing of new features pretty much impossible. Would there be a possibility to a) correct this (at least for the main number of UCI datasets) b) synchronize this at various timestamps, so for example every week we have guaranteed a direct copy of the live server on the test server? (And in this case, I do not mean for backup purposes, but for test purposes)

I can imagine also @giuseppec and @mfeurer would greatly benefit from this.

joaquinvanschoren commented 7 years ago

Hmm, if I ask the API for dataset 2: https://test.openml.org/api/v1/data/2

It tells me that the dataset is at http://www.openml.org/data/download/1666876/phpFsFYVN

This is correct, but not ideal, since it points to the production server. How exactly do we store this URL, this is stored in the database right?

If https://test.openml.org/data/download/2/anneal doesn't work, it seems to be problem with the data controller?

On Mon, Mar 6, 2017 at 11:35 AM janvanrijn notifications@github.com wrote:

I have the impression that the Test Server is hopelessly out of sync with the live server. For example, most of the UCI datasets are not available or linked correctly, e.g.: https://test.openml.org/data/download/2/anneal https://test.openml.org/data/download/24/mushroom https://test.openml.org/data/download/42/soybean

This makes large scale testing of new features pretty much impossible. Would there be a possibility to a) correct this (at least for the main number of UCI datasets) b) synchronize this at various timestamps, so for example every week we have guaranteed a direct copy of the live server on the test server? (And in this case, I do not mean for backup purposes, but for test purposes)

I can imagine also @giuseppec https://github.com/giuseppec and @mfeurer https://github.com/mfeurer would greatly benefit from this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQVwGCFpuCEEBT3g6XVW0IST3LXtAQks5ri-FwgaJpZM4MT-ev .

-- Thank you, Joaquin

janvanrijn commented 7 years ago

The URL works, however this is a deprecated way to obtain the dataset. For various reasons, one of them being all the URLS pointing hardcoded to " http://openml.org", whereas every server url change breaks this. (We should talk with the core developers to remove this field soon, but that is not the point of this issue.)

The correct way of obtaining the dataset is using the file_id and the data controller, which brings you to https://test.openml.org/data/download/1666876/phpFsFYVN (sorry, 1666876 iso 2) This is what the api connector does. And fails, because this file is not present on the test server.

This is what I want to discus in this issue. My point is that we need to make sure that the database and the files of the live server are sort of constantly synchronized with the test server, so that we have a reliable test environment.

2017-03-06 13:55 GMT+01:00 Joaquin Vanschoren notifications@github.com:

Hmm, if I ask the API for dataset 2: https://test.openml.org/api/v1/data/2

It tells me that the dataset is at http://www.openml.org/data/download/1666876/phpFsFYVN

This is correct, but not ideal, since it points to the production server. How exactly do we store this URL, this is stored in the database right?

If https://test.openml.org/data/download/2/anneal doesn't work, it seems to be problem with the data controller?

On Mon, Mar 6, 2017 at 11:35 AM janvanrijn notifications@github.com wrote:

I have the impression that the Test Server is hopelessly out of sync with the live server. For example, most of the UCI datasets are not available or linked correctly, e.g.: https://test.openml.org/data/download/2/anneal https://test.openml.org/data/download/24/mushroom https://test.openml.org/data/download/42/soybean

This makes large scale testing of new features pretty much impossible. Would there be a possibility to a) correct this (at least for the main number of UCI datasets) b) synchronize this at various timestamps, so for example every week we have guaranteed a direct copy of the live server on the test server? (And in this case, I do not mean for backup purposes, but for test purposes)

I can imagine also @giuseppec https://github.com/giuseppec and @mfeurer https://github.com/mfeurer would greatly benefit from this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386, or mute the thread https://github.com/notifications/unsubscribe-auth/ ABpQVwGCFpuCEEBT3g6XVW0IST3LXtAQks5ri-FwgaJpZM4MT-ev .

-- Thank you, Joaquin

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-284388500, or mute the thread https://github.com/notifications/unsubscribe-auth/ACL7-seKvB1J_-2GzThXOFGE6yJER3sWks5rjAI3gaJpZM4MT-ev .

joaquinvanschoren commented 7 years ago

Ok, to be clear: the URL that is deprecated is a field in the database, not the URL mentioned in the dataset description (that should stay). Hence, this is a backend-only issue, and does not affect the client APIs.

janvanrijn commented 7 years ago

Correct

2017-03-06 15:19 GMT+01:00 Joaquin Vanschoren notifications@github.com:

Ok, to be clear: the URL that is deprecated is a field in the database, not the URL mentioned in the dataset description (that should stay). Hence, this is a backend-only issue, and does not affect the client APIs.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-284408183, or mute the thread https://github.com/notifications/unsubscribe-auth/ACL7-n2rqLx3YmuGQ1epJ3xu2Gth8W0Jks5rjBYIgaJpZM4MT-ev .

jaksmid commented 7 years ago

Is that completely true? It would be good if the URL would be relative and would point to the test/prod web accordingly.

giuseppec commented 7 years ago

Hi, yeah I am having trouble here. The xml in https://test.openml.org/api/v1/run/4 , for example, links to the prediction file https://test.openml.org/data/download/70/weka_generated_predictions9110805785467244853.arff which does not exist. Do I have to replace test with www, i.e., https://www.openml.org/data/download/70/weka_generated_predictions9110805785467244853.arff now or will you fix this on the test server?

Currently, the our unit tests are broken because of this :(. Any suggestions what I should do to get this working soon?

janvanrijn commented 7 years ago

We should fix the test server. All together, it should be an easy fix to sync everything from live for once to the test server.

However, it would be my preference to do this periodically (e.g., weekly) because of the amount of development stuff (faulty configurations and so on) that are on there. I assume that @mfeurer has the same opinion on this. What do you think?

joaquinvanschoren commented 7 years ago

So, what happened? Were files deleted on the test server?

In any case, we can sync them back?

We can regularly copy new datasets to the test server, but we can't assume a full sync right? People may put datasets/flows/... on the test server for testing, and we can't overwrite those. On Fri, 10 Mar 2017 at 17:01, janvanrijn notifications@github.com wrote:

We should fix the test server. All together, it should be an easy fix to sync everything from live for once to the test server.

However, it would be my preference to do this periodically (e.g., weekly) because of the amount of development stuff (faulty configurations and so on) that are on there. I assume that @mfeurer https://github.com/mfeurer has the same opinion on this. What do you think?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-285707527, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV0i24GiQr3JG4rVALQ9Lrg-7TTihks5rkXPFgaJpZM4MT-ev .

-- Thank you, Joaquin

janvanrijn commented 7 years ago

The latter depends on how we want to use the test server, I think.

From my personal POV, I can see how having a 'reliable' and relatively synced test server is more useful than one that contains many development stuff. For example, during the development of the Python API many flows/runs got uploaded by means of a underdeveloped function that we later decided to be invalid.

But I think it is important to know how the developers from other look at this? @giuseppec @berndbischl

Furthermore, I wouldn't know how much time it would cost to set this up, in particular the weekly database synchronizations

joaquinvanschoren commented 7 years ago

You can set up the test DB to be a slave of the production DB, but that would make sone types of testing impossible. A weeky sync is probably doable, but we will run into hardware issues as the production DB gets larger and larger. Also some test will become slower and slower.

@Bernd @Giuseppe @MFeurer ?

What do we need for Giuseppe's current issue? Is this just a missing file?

Cheers, Joaquin

On Fri, Mar 10, 2017 at 5:26 PM janvanrijn notifications@github.com wrote:

The latter depends on how we want to use the test server, I think.

From my personal POV, I can see how having a 'reliable' and relatively synced test server is more useful than one that contains many development stuff. For example, during the development of the Python API many flows/runs got uploaded by means of a underdeveloped function that we later decided to be faulty.

But I think it is important to know how the developers from other look at this? @giuseppec https://github.com/giuseppec @berndbischl https://github.com/berndbischl

Furthermore, I wouldn't know how much time it would cost to set this up, in particular the weekly database synchronizations

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-285714732, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV9ix2DxYNYBoSCzZQ8umBLHOEpPvks5rkXnDgaJpZM4MT-ev .

-- Thank you, Joaquin

giuseppe commented 7 years ago

I am the wrong "Giuseppe" :) tagging the correct one I guess @giuseppec

giuseppec commented 7 years ago

I think I don't really have a strong opinion regarding a test server that is always in sync with the production server. For me it is just important that stuff I upload to the test server doesn't appear on the production server. Regarding the hardware issue due to the "copy and paste" of the DB: What if you simply "redirect" to the production server and introduce a "prefix" for stuff I upload to the test server, e.g.

1) Simply redirect stuff from the test server to the production server, e.g. https://test.openml.org/api/v1/run/4 always redirects to https://www.openml.org/api/v1/run/4 to avoid wasting hardware resources. 2) Stuff that was uploaded to the test server should always start with a "prefix", e.g. 000, for example https://test.openml.org/api/v1/run/0001 could be the first run uploaded to the test server and is only available on the test server.

Is something like this possible? While writing this post, I noticed that this might be completely unnecessary, isn't it? I mean if I want to use stuff from the production server, I can simply directly use the production server. I just want to use the test server to upload stuff I probably don't want to be on the production server.

What do we need for Giuseppe's current issue? Is this just a missing file?

Regarding the current issue: It is not only one missing file but serveral missing fileS I guess.

janvanrijn commented 7 years ago

To be precise, all of them. Except for iris (task 59), which is on every data mining server ;)

On 10 Mar 2017 18:18, "giuseppec" notifications@github.com wrote:

I think I don't really have a strong opinion regarding a test server that is always in sync with the production server. For me it is just important that stuff I upload to the test server doesn't appear on the production server. Regarding the hardware issue due to the "copy and paste" of the DB: What if you simply "redirect" to the production server and introduce a "prefix" for stuff I upload to the test server, e.g.

  1. Simply redirect stuff from the test server to the production server, e.g. https://test.openml.org/api/v1/run/4 always redirects to https://www.openml.org/api/v1/run/4 https://www.openml.org/api/v1/run/4 to avoid wasting hardware resources.
  2. Stuff that was uploaded to the test server should always start with a "prefix", e.g. 000, for example https://test.openml.org/api/v1/run/0001 could be the first run uploaded to the test server and is only available on the test server.

Is something like this possible? While writing this post, I noticed that this might be completely unnecessary, isn't it? I mean if I want to use stuff from the production server, I can simply directly use the production server. I just want to use the test server to upload stuff I probably don't want to be on the production server.

What do we need for Giuseppe's current issue? Is this just a missing file?

Regarding the current issue: It is not only one missing file but serveral missing fileS I guess.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-285729020, or mute the thread https://github.com/notifications/unsubscribe-auth/ACL7-oV99TapFqbo8mDp9LfTSqp0CW4sks5rkYX7gaJpZM4MT-ev .

giuseppec commented 7 years ago

Ok then please let me know when this is fixed. I have to submit the revised Rpaper and need to update the Rpackage before this and currently I cannot unittest properly .

joaquinvanschoren commented 7 years ago

OK, I checked. The test server does not have enough space for a full copy of the production servers.

I managed to do a full sync of the files, but am using 91% of disk space now.

The disk that holds the test database is at 99%. I can maybe clear up some space, but syncing the databases won't work long term.

@giuseppec You should be able to run all unit tests again

@janvanrijn If you want, we can see whether a DB copy is possible. Also, I don't understand why this one file (or maybe a few) were missing. Everything else seems to be there.

Cheers, Joaquin

On Fri, Mar 10, 2017 at 7:28 PM giuseppec notifications@github.com wrote:

Ok then please let me know when this is fixed. I have to submit the revised Rpaper and need to update the Rpackage before this and currently I cannot unittest properly .

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-285746834, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV0Lp4fOW-wIRKQ98cVAX67x7WtT0ks5rkZYzgaJpZM4MT-ev .

-- Thank you, Joaquin

giuseppec commented 7 years ago

Hm almost all listing APIs are failing, e.g., https://test.openml.org/api/v1/json/task/list/limit/50/

<div style="border:1px solid #990000;padding-left:20px;margin:0 0 10px 0;">

<h4>A PHP Error was encountered</h4>

<p>Severity: Warning</p>
<p>Message:  fopen(/home/jvanscho/openmldata/webdata/log/sql.log): failed to open stream: Permission denied</p>
<p>Filename: models/Log.php</p>
<p>Line Number: 35</p>
giuseppec commented 7 years ago

Listing works now but uploading somthing (flows, datasets, runs ...) to the test server fails:

<div style="border:1px solid #990000;padding-left:20px;margin:0 0 10px 0;">

<h4>A PHP Error was encountered</h4>

<p>Severity: Warning</p>
<p>Message:  move_uploaded_file(/home/jvanscho/openmldata/webdata/implementation/binary/classif.xgboost...
<p>Filename: models/File.php</p>
<p>Line Number: 20</p>
joaquinvanschoren commented 7 years ago

Sorry, the sync messed up the permissions. I just tested listing and uploading runs, and this works. Can you please try again?

giuseppec commented 7 years ago

1) Listing and uploading seem to work now. 2) However, deleting stuff does not work :D

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Database Error</title>
<style type="text/css">

::selection { background-color: #E13300; color: white; }
::-moz-selection { background-color: #E13300; color: white; }

body {
    background-color: #fff;
    margin: 40px;
    font: 13px/20px normal Helvetica, Arial, sans-serif;
    color: #4F5155;
}

a {
    color: #003399;
    background-color: transparent;
    font-weight: normal;
}

h1 {
    color: #444;
    background-color: transparent;
    border-bottom: 1px solid #D0D0D0;
    font-size: 19px;
    font-weight: normal;
    margin: 0 0 14px 0;
    padding: 14px 15px 10px 15px;
}

code {
    font-family: Consolas, Monaco, Courier New, Courier, monospace;
    font-size: 12px;
    background-color: #f9f9f9;
    border: 1px solid #D0D0D0;
    color: #002166;
    display: block;
    margin: 14px 0 14px 0;
    padding: 12px 10px 12px 10px;
}

#container {
    margin: 10px;
    border: 1px solid #D0D0D0;
    box-shadow: 0 0 8px #D0D0D0;
}

p {
    margin: 12px 15px 12px 15px;
}
</style>
</head>
<body>
    <div id="container">
        <h1>A Database Error Occurred</h1>
        <p>Error Number: 1054</p>
<p>Unknown column 'implementation_id' in 'where clause'</p>
<p>SELECT *
FROM `evaluation`
WHERE `implementation_id` = "12647"
 LIMIT 1</p>
<p>Filename: models/abstract/Database_read.php</p>
<p>Line Number: 37</p>  </div>
</body>
</html>

3) Furthermore, the Website for the test server is strange: although a run exists (see https://test.openml.org/api/v1/run/528271), the website shows a This is not the run you are looking for message (see https://test.openml.org/r/528271) and returns:

Severity: Notice

Message: Undefined property: MY_Loader::$run

Filename: r/pre.php

Line Number: 200

Backtrace:

File: /var/www/openml.org/public_html/openml_OS/views/pages/frontend/r/pre.php
Line: 200
Function: _error_handler

File: /var/www/openml.org/public_html/openml_OS/helpers/cms_helper.php
Line: 19
Function: view

File: /var/www/openml.org/public_html/openml_OS/controllers/Frontend.php
Line: 79
Function: loadpage

File: /var/www/openml.org/public_html/index.php
Line: 334
Function: require_once 

4) but tagging runs does not work (maybe because of 3?)

joaquinvanschoren commented 7 years ago
  1. Great
  2. Seems like @janvanrijn already started with simplifying the database :). @giuseppe, you were deleting a flow, right? I found the bug and removed it.
  3. The search engine is down. Probably another permissions issue. Working on it.
  4. Did you tag via the website or API?
joaquinvanschoren commented 7 years ago

3,4. Search engine is running again, tagging works fine.

joaquinvanschoren commented 7 years ago

Remaining issues I see in this thread:

giuseppec commented 7 years ago

I still have the problem with 3., see e.g., https://test.openml.org/r/528271 which shows:

Severity: Notice

Message: Undefined property: MY_Loader::$run

Filename: r/pre.php

Line Number: 200

Backtrace:

File: /var/www/openml.org/public_html/openml_OS/views/pages/frontend/r/pre.php
Line: 200
Function: _error_handler

File: /var/www/openml.org/public_html/openml_OS/helpers/cms_helper.php
Line: 19
Function: view

File: /var/www/openml.org/public_html/openml_OS/controllers/Frontend.php
Line: 79
Function: loadpage

File: /var/www/openml.org/public_html/index.php
Line: 334
Function: require_once 

Therefore, I also still have issues with 4.

joaquinvanschoren commented 7 years ago

Indeed, I forgot to rebuild the search index. I have done that now for that run. It's now indexing all other runs as well, should be done in an hour or so.

giuseppec commented 7 years ago

Will check it when test server is up again (seems to be currently down)

joaquinvanschoren commented 7 years ago

Ugh, disk got full again during indexing. I have cleared up a lot of space. Should be good now.

giuseppec commented 7 years ago

We are almost done. For new runs, the evaluations are not computed, see .g. https://test.openml.org/r/544275 (or the list https://test.openml.org/search?type=run showing the status No evaluations yet (or not applicable). )

janvanrijn commented 7 years ago

XML descriptions on test server point to url on live server: https://test.openml.org/api/v1/data/2

We should remove the field from the database and handcraft it in the xml, dependent on location (some problems exist here)

Everything else seems to be there.

Not really. I am currently testing the evaluation engine (updated version) on the test server, most of the run files seem to be absent.

How can we fix this, i.e., make sure that the runs that are registered also have the files attached? The way it is right now, there is no way of reliably testing my current update (which is kind of huge)

janvanrijn commented 7 years ago

Update to everyone, I just fixed the issue with the data url field.

FYI, every file that is uploaded to the OpenML servers also gets an file_id. In that regard, the url field is information duplicate. Currently, its only use is for datasets that are not stored on the OpenML servers.

When the file id is present, it now automatically generates the URL to download from. This way, it will always be from the correct server.

As a consequence, if the file is absent on the (test)server, it will give an 404 error. I know this might seem inconvenient, but this removes a huge assumption problem from the system, that would become bigger over time.

If this leads to problems regarding unit tests, we need to find a way to select an appropriate set of datasets/flows/runs to sync on the test server and work with those.

joaquinvanschoren commented 7 years ago

Do I only get a 404 if a file is missing on the test server while it is expected to be there, or also for datasets stored elsewhere? On Thu, 16 Mar 2017 at 14:17, janvanrijn notifications@github.com wrote:

Update to everyone, I just fixed the issue with the data url field.

FYI, every file that is uploaded to the OpenML servers also gets an file_id. In that regard, the url field is information duplicate. Currently, its only use is for datasets that are not stored on the OpenML servers.

When the file id is present, it now automatically generates the URL to download from. This way, it will always be from the correct server.

As a consequence, if the file is absent on the (test)server, it will give an 404 error. I know this might seem inconvenient, but this removes a huge assumption problem from the system, that would become bigger over time.

If this leads to problems regarding unit tests, we need to find a way to select an appropriate set of datasets/flows/runs to sync on the test server and work with those.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-287053616, or mute the thread https://github.com/notifications/unsubscribe-auth/ABpQV6KTqEuuWJjjU4SAvsva21ksL6rSks5rmTZ-gaJpZM4MT-ev .

-- Thank you, Joaquin

janvanrijn commented 7 years ago

Datasets stored elsewhere are handled as such. However, if I am correct, this are currently exactly 5 datasets. We might need to reengineer this feature before more come in

janvanrijn commented 7 years ago

My suggestion would be: truncate the database and file system every first of the month. Erase everything completely. Then import some key datasets (for example the UCI set), key flows (default weka, default mlr, default sklearn) and key tasks. Also fill necessary tables (e.g., math_function, data_quality, etc.)

Now when unit testing something related to a run, a run needs to be submitted first. Most of my Unit tests that are performing a run_get operation, are at the same time performing a run_upload.

This is also how I developed the system before we went 'live'

giuseppec commented 7 years ago

I guess most of the unit tests seem to pass now. There is just still the issue with the url that links to missing files, e.g. https://test.openml.org/data/download/1848702/predictions33fc742e7d18.arff when trying to get the run 542235 from https://test.openml.org/api/v1/run/542235. If we are 100% sure that something like this can not happen on the main server, I could rewrite this unit test. What is the best way of resolving this 'issue'?

janvanrijn commented 7 years ago

I am happy to hear that.

(The evaluation engine is currently at run id ~ 370k. We currently have 8 parallel processes running, but cannot initiate more, as the server is running against the RAM limits)

I just 'hand ran' this particular run, it gives a 404 error (file does not exist on test server) It does exist on the live server though. I think we can be fairly sure that this will not happen on live.

My analysis of what happened:

I suggest that you adapt the testcase to a different run number. All the runs that are flagged as 'without error' on the test server are now guaranteed to be without errors. Furthermore, I would advise to pick one with a low run id. These kind of tests are not performed often, but when they are, apparently the evaluation engine takes a long time to catch up.

Apart from this, do we have green light to merge the development branch on the live server?

2017-03-19 19:39 GMT+01:00 giuseppec notifications@github.com:

I guess most of the unit tests seem to pass now. There is just still the issue with the url that links to missing files, e.g. https://test.openml.org/data/download/1848702/predictions33fc742e7d18.arff when trying to get the run 542235 from https://test.openml.org/api/ v1/run/542235. If we are 100% sure that something like this can not happen on the main server, I could rewrite this unit test. What is the best way of resolving this 'issue'?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-287637210, or mute the thread https://github.com/notifications/unsubscribe-auth/ACL7-h5feSpJsNhnaVBTBAJQJEqXDmD5ks5rnXZYgaJpZM4MT-ev .

janvanrijn commented 7 years ago

The problem is getting worse:

[janvanrijn@capa public_html]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda2 118G 111G 147M 100% / tmpfs 95G 0 95G 0% /dev/shm /dev/sda1 976M 159M 766M 18% /boot /dev/sdb1 5,3T 5,0T 37G 100% /home

Test server is full. We should delete something.

giuseppec commented 7 years ago

Hm... this will always happen if you want to keep the test server in sync with the main server, right? Your suggestion was to have a test server that will be synced once a month or so but apparently the hardware makes this unrealistic, right? To be honest I don't care about a test server that is in sync with the main server. I just need a test server to upload meaningless testing stuff. For me, it would be sufficient to have a test server containing, say, only the openml100 datasets/tasks. I can then upload the runs and flows I want to test. Wouldn't a clean test server containing only the openml100 datasets/tasks solve this issue of "full test server"?

janvanrijn commented 7 years ago

I completelly agree.

On 21 Mar 2017 12:01, "giuseppec" notifications@github.com wrote:

Hm... this will always happen if you want to keep the test server in sync with the main server, right? Your suggestion was to have a test server that will be synced once a month or so but apparently the hardware makes this unrealistic, right? To be honest I don't care about a test server that is in sync with the main server. I just need a test server to upload meaningless testing stuff. For me, it would be sufficient to have a test server containing, say, only the openml100 datasets/tasks. I can then upload the runs and flows I want to test. Wouldn't a clean test server containing only the openml100 datasets/tasks solve this issue of "full test server"?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openml/OpenML/issues/386#issuecomment-288044742, or mute the thread https://github.com/notifications/unsubscribe-auth/ACL7-hKgo9C0qSaWevVLIQVJNrlcmPIBks5rn64AgaJpZM4MT-ev .

janvanrijn commented 7 years ago

Today I worked on this a bit. Will be finished tomorrow.

In the meantime, many testcases will probably break. (No tasks, no flows, no runs available on server, yet.. We do have the OpenML 100 though)

giuseppec commented 7 years ago

How much hardware resources is now available after this "cleanup"?

giuseppec commented 7 years ago

the oml:data_set_description/oml:md5_checksum field seems to be missing for data in the test server -> makes trouble

janvanrijn commented 7 years ago

I just finished setting up the new test server.

Although it is not a cronjob (yet), with one push of the button we can completely reset the whole testserver. The script is located in my home directory on capa, @joaquinvanschoren and I can both execute ~/openmltest/reset.sh (This script is NOT available in the git, and should never be uploaded to the live server)

Whenever this script gets executed, it completely deletes all user data, the expdb_database, and search indices. Then reinstalls the database with some basic column fillings. In this case

One of the nice properties is that it (initially) does not rely on any files. As OpenML can work with 'external' data sources, all datasets are stored externally on the live server. Hence, the url field of the datasets will point to www.openml.org, but in this case that is intended behaviour and it is guaranteed that the mapping is correct.

Now the interesting part can start. I ran all Weka and Python Unit tests (had to adapt some task ids, but all seems fine). However, it is a very good test to see how various parts of the website react to the absence of data. This is something we could not test before (for example, the server always contained runs, and always contained tasks for all datasets)

Some parts of the website seem to not handle this so well. I think it is important to fix these accordingly. Let's try and adapt our code so it works in all cases, even without making the assumption that (some) data is present.

janvanrijn commented 7 years ago

the oml:data_set_description/oml:md5_checksum field seems to be missing for data in the test server -> makes trouble

That's kind of annoying. The situation is the following: all datasets do have a md5_checksum, however not the datasets that are stored on a different location. As on the live server there are only 5 such datasets (e.g., 5304, 5305, 23411, ... ) and that leaves us with a (suppressed) PHP error. Now on the test server, all datasets are stored on a different server (i.e., the live server). In order to prevent the (suppressed) error, I only show that field now for datases of which we have that information.

I could add a dummy field, but that would be a horrible hack (i.e., presenting something as information while it is absent). Is there any other way we can solve this?

giuseppec commented 7 years ago

Hm, is it then a bug in the R package if we throw an error when the XML does not contain a md5_checksum field? I was not aware of this "special case" when data is hosted on a different location. We are just relying on what the XSD scheme tells us https://github.com/openml/website/blob/master/openml_OS/views/pages/api_new/v1/xsd/openml.data.upload.xsd and as far as I can see the XSD tells me that a the XML should contain a md5_checksum field (if not, we throw an error).

janvanrijn commented 7 years ago

Du hast recht.

I put the field back. It contains "NotApplicable" in this case. We can discuss later how we handle this long term.

giuseppec commented 7 years ago

https://test.openml.org/search?type=run is empty and evaluations of runs do not exist (maybe due to https://github.com/openml/OpenML/issues/400 or is the evaluation engine of the test server independent from the one of the main server?).

janvanrijn commented 7 years ago

There are no runs available on the testserver image. (see https://github.com/openml/OpenML/issues/386#issuecomment-289148422) I could add those, but that's going to be a pain. Both the Java and Python API are currently using the live server for run listings.

Of course, you can upload runs to the test server. Furthermore, there is an evaluation engine running, making sure that uploaded runs will usually be evaluated within a minute.

giuseppec commented 7 years ago

Ok, https://test.openml.org/r/66 (this is one of my runs) and https://test.openml.org/r/50 (this is one of your runs I guess) both still don't seem to contain evaluations (i.e. seems to take much more than a minute)?

janvanrijn commented 7 years ago

You are right. I checked it, and apparently the evaluation engine is also incapable of evaluating runs that had an external dataset. Will push a fix.

janvanrijn commented 7 years ago

Should be fixed.

janvanrijn commented 7 years ago

For clarity, see the API views. There are still some issues with Elastic Search, I assume @joaquinvanschoren will fix these at some point.