Closed gerwolf closed 1 year ago
@VFMR FYI
admin-prod-01
and are called IBS_training_dataset.csv
and IBS_test_dataset.csv
. Please make sure that only the former IBS_training_dataset.csv
is made available on source-date
!sector-id
) and collapse them into a single dataset which can be joined through idnum
?@gerwolf
@VFMR FYI
- I generated those two files; they are on
admin-prod-01
and are calledIBS_training_dataset.csv
andIBS_test_dataset.csv
. Please make sure that only the formerIBS_training_dataset.csv
is made available onsource-date
!
Where did you put the test data, btw? I cannot find it.
We might need an overview of all the target names because I do not think it is very helpful to predict all 150 columns of the final month. There are things like the date and time of survey participation that need not be predicted. I would suggest to have only the ~ 15 main question item targets. I can make a file with the target names and the structure of the test data. For automatic data evaluation, the test data should hence only include the idnum and this selection of features.
Also, the sample is unbalanced. Thus, participants need to know for which ids they need to predict!
I suggest providing a sample_submission.csv
that contains the feature names and all the idnums
but leaves the other variables empty. I have prepared such a file and a new test data file to go along with it. The IBS_sample_submission.csv
will be provided in the source data buckets (it is currently only in the one for work_prod_01
and the IBS_test_data.csv
is in ~/Desktop/project_directory/input/
.
Another important thing to note about validation: the survey items are not mandatory! they may be missing in some cases.
We might need an overview of all the target names because I do not think it is very helpful to predict all 150 columns of the final month. There are things like the date and time of survey participation that need not be predicted. I would suggest to have only the ~ 15 main question item targets.
The three targets of interest are: vg_statebus
, vg_comexp
and vg_priceexp
. Those are the only columns we care to predict.
Okay. I created a new sample_submission file including with only these targets.
Sorry for the confusion, here's how to go about it:
vg_weight
and sector_id
; I confirmed that they are almost time-invariant: dropping duplicate rows (without year-month) reduces the panel from 1,808,565 rows to 58,247 i.e. by 97% which is a bit less than just counting unique idnum
which make up about 0.01% of the entire panel size, meaning there is some time-variation in these features but not much. The idea is that this information will be provided to the participants after they have done some feature engineering and can confirm any clusters they have identified to interpret them qualitatively.source-data/work-vm-prod01/IBS Rollout Master
you find the following files: IBS_paneldata_train.csv
(the main dataset for model development), IBS_basedata_train.csv
(provided separately at a later stage), IBS_paneldata_test.csv
(the main dataset for predictions evaluation, we use only the three targets vg_statebus
, vg_comexp
and vg_priceexp
as correctly pointed out by @muskuloes above and lastly IBS_basedata_test.csv
(which isn't really required).year
, month
and idnum
.@VFMR please roll out the file IBS_paneldata_train.csv
in directory source-data/work-vm-prod01/IBS Rollout Master
for now only. Thanks!
the file IBS/IBS_paneldata_train.csv
is now copied to all sourcedata folders. I checked VMs 2, 16, and 34. All could see the file as planned.
If there are no futher requirements in this regard, the issue can be closed.
Great, @VFMR thanks a lot. Where did you move the files IBS_basedata_train.csv
, IBS_paneldata_test.csv
and IBS_basedata_test.csv
?
@gerwolf These files are in project-directory/input/...
of the admin vm
test files are now in shared-data/IBS
@VFMR Please roll out the final dataset for the ifo challenge to all VMs as discussed on Friday.
Test period will be December 2022, please drop that month for all entities and leave one original complete file as it is now on the
admin-vm
for validation. I'll set up a Gitea repo for @muskuloes and me to finalise the leaderboard (Streamlit should be working by now).Please drop any columns such as industry classification, firm size etc. which are (almost) time-invariant and provide this information in a separate
.csv
file which can be merged to the main panel.We should aim to have everything finalised by Wednesday 26 April 2023 afternoon.