kahst / BirdCLEF2017

Source code of the TUCMI submission to BirdCLEF2017
MIT License
40 stars 13 forks source link

birdCLEF_sort_data.py creates ? in folder names #4

Closed dreamflasher closed 6 years ago

dreamflasher commented 6 years ago

After executing birdCLEF_sort_data.py the train and val subfolders for birds contain question marks (?). Which are not existing in the labels file.

igo312 commented 6 years ago

Hello, I'm going to do this work too. If you use Windows system , yes, it cannot work.It's just about system's problem.As for me I replace d'?' to '!' and it worked

kahst commented 6 years ago

@DreamFlasher did that resolve the problem? That kind of error is new to me, but I never gave it a run on Windows.

dreamflasher commented 6 years ago

I'm not on Windows. Ubuntu 16.04

kahst commented 6 years ago

I tried to put some info on the file into the file path: Genus, species, classid, recording quality and the source filename itself. Where exactly do the question marks occur? Maybe they are originating from some of the XML data. If that's the case, you could try to change the naming scheme. Do the question marks also occur in the filepaths.txt?

dreamflasher commented 6 years ago

Okay I am trying to debug this again. I got the data from here: https://www.crowdai.org/clef_tasks/5/task_dataset_files?challenge_id=24 -- They say it's the 2017 data. I have 24506 xml files in there, does that match with the dataset you have?

When I run the birdCLEF_sort_data.py script, then I have question marks in the folders. Eg. a full path could look like: data/train/src/Piaya cayana mehleri ? ioagnm/3_LIFECLEF2015_BIRDAMAZON_XC_WAV_RN15403.wav -- and there are about 40 cases of this questionmark thing. And yes, this is also in the filepaths.txt. Checking the according xml file confirms: The ? is alreay in the xml file (sub-species=mehleri ?). Maybe it's best to only go with the classid's?

I went for the classid's and got 999 different classid's. https://github.com/kahst/BirdCLEF2017/blob/master/birdCLEF_class_labels.py has 1500. So this doesn't match. Maybe the dataset is different afterall?

kahst commented 6 years ago

The 2017 data consists of two archives. One containing 999 species and the other one has 501 species in it. If you download both (they are both listed in the dataset section of the challenge) you should be able to re-create the dataset from 2017.

If the questionmarks occur in the xml files, you could try not to use the sub-species as part of the filename. If you use the class id only, you should be fine, the rest of the filename is just for better readability. You can change the sort data script to include only the class id in the filename.

dreamflasher commented 6 years ago

Thank you. Yeah I figured that out today too, that there is a second part of the dataset. But it looks like the archive is missing 4 wav files. But now it looks a lot better. Also I changed the script to only use the classids. But I only get 1330 unique id's after the sorting. So something is still off.

kahst commented 6 years ago

You're right, the archive misses four files, but that shouldn't be a problem. I'm not sure why you only get 1330 species - using the class ids should result in exactly 1500 species. I tried that with the current download and it still works the same way as it did last year. Did you find a reason for this? Maybe some kind of duplicates?