Open gsganden opened 5 years ago
Would it be possible to train models for each different part of the year? So a model for summer, a model for fall, etc. With these models it would be possible to conclude a probability by either (1) taking the sum of the probabilities of each model or (2) detecting the season in which the photo was taken.
It would be possible, but I doubt that it would be better. Do you think that it would be?
I think it would be possible to get better results. Detecting objects (or in this case animals) is easier/more efficient in similar environments. I think, that this would factor in for this algorithm as well.
Try it!
Hi all! Came across this project and it looks both interesting and in my wheelhouse. Interested in contributing if you are open to it and it is still an active project.
@DigitalPhilosopher are you working on this issue or is it up for grabs?
Great! It is absolutely an active project in theory :-). Addressing this issue in particular would be a great help in moving it forward.
@gsganden Good to know ! I've made progress cleaning the 2012-14 labels in the CSV to align with 16-17, but I may need some extra information to restructure the image filenames to run through the build_dataset process. Particularly to recreate a process_raw.py for the 2012-14 images.
Is there any additional documentation regarding filename structure?
I am not aware of any. @mfidino can you provide additional information?
The filename structure should be something like:
./images_2016/DPT/D03-AMP1/ _CHIL - D03-AMP1-JU16_00037.JPG
This represents:
./{year sampled}/{transect sampled}/{site sampled}/{file name}
Transect is section of the city we are sampling (DPT is north west, RST is west, SCT is southwest, JNT is the heart of Chicago. What I think is likely the most important this is the site sampled part of the file name, which likely lines up with the 2012 - 2014 data?
You could also just parse the {file name}
part to get the site info if you wanted.
I'm seeing two issues with the 2012-2014 data:
SP12/DPT/D02-HUP1-SP12/
. However, the paths in the actual "FA" directories don't repeat the season in the third component, e.g. FA12/DPT/D02-HUP1/
. @datmar is that what you are seeing?labels.csv
only covers 2012. @mfidino could we get labels for the 2013 and 2014 images?It's going to take me a bit to look into the labels for the 2013 and 2014 images for a couple reasons.
Way back when we tagged those images we used to write the species tag into the photo metadata. I wrote a ruby script a long time ago to pull those tags (If I recall all the keywords are in the under subject
in the exif data.
Ruby script here. Which may at least point out to where you can look to get the species tags. https://github.com/mfidino/photo_pull/blob/master/parse_photo_exif.rb
Thank you, and congrats on the new baby!
mfidino thank you, that's good to know!
@gsganden Sorry haven't had much time to work on this, schedule's been a little upended due to covid-19 (I'm in Manhattan). Will squeeze some time into getting this done in the next few days.
I'm seeing two issues with the 2012-2014 data:
- The CSV says that the file paths start like this, with the season/year repeated in the third component:
SP12/DPT/D02-HUP1-SP12/
. However, the paths in the actual "FA" directories don't repeat the season in the third component, e.g.FA12/DPT/D02-HUP1/
. @datmar is that what you are seeing?labels.csv
only covers 2012. @mfidino could we get labels for the 2013 and 2014 images?
Sorry haven't had much time to work on this, schedule's been a little upended due to covid-19 (I'm in Manhattan). Will squeeze some time into getting this done in the next few days.
No worries, this is all voluntary so anything you can do is a bonus. I have time to work on this project today and this ticket is by far the top priority, so I'll be working on it as well.
def get_path(filename_build):
if filename_build.startswith("FA"):
filename_build = Path(filename_build)
return (
Path(os.getenv("AUTOFOCUS_DATA_DIR"))
/ "lpz_2012-2014"
/ "raw"
/ Path(*filename_build.parts[:2])
/ "-".join(filename_build.parts[2].split("-")[:2])
/ Path(*filename_build.parts[3:])
)
else:
return (
Path(os.getenv("AUTOFOCUS_DATA_DIR"))
/ "lpz_2012-2014"
/ "raw"
/ filename_build
)
df.loc[:, "path"] = df.loc[:, "filename_build"].progress_apply(get_path)
>> from PIL import Image
>>
>>
>> def get_exif(filename):
>> image = Image.open(filename)
>> image.verify()
>> return image._getexif()
>>
>>
>> df.loc[:, "path"].progress_apply(get_exif).notna().mean()
0.0
>> import piexif
>>
>> piexif.load(str(df1.loc[0, "path"]))
{'0th': {}, 'Exif': {}, 'GPS': {}, 'Interop': {}, '1st': {}, 'thumbnail': None}
>> identify -verbose "data/lpz_2012-2014/raw/FA14/JNT/J01-LMP1/J01-LMP1-FA14 (11).JPG"
Image: data/lpz_2012-2014/raw/FA14/JNT/J01-LMP1/J01-LMP1-FA14 (11).JPG
Format: JPEG (Joint Photographic Experts Group JFIF format)
Class: DirectClass
Geometry: 227x227+0+0
Resolution: 72x72
Print size: 3.15278x3.15278
Units: Undefined
Type: TrueColor
Endianess: Undefined
Colorspace: sRGB
Depth: 8-bit
Channel depth:
red: 8-bit
green: 8-bit
blue: 8-bit
Channel statistics:
Red:
min: 0 (0)
max: 255 (1)
mean: 125.861 (0.493571)
standard deviation: 47.4512 (0.186083)
kurtosis: 0.0259633
skewness: 0.116636
Green:
min: 0 (0)
max: 255 (1)
mean: 129.525 (0.507942)
standard deviation: 47.0568 (0.184536)
kurtosis: 0.046401
skewness: -0.0850175
Blue:
min: 0 (0)
max: 224 (0.878431)
mean: 67.0616 (0.262987)
standard deviation: 43.2846 (0.169743)
kurtosis: -0.0088973
skewness: 0.538302
Image statistics:
Overall:
min: 0 (0)
max: 255 (1)
mean: 107.482 (0.4215)
standard deviation: 45.9693 (0.180272)
kurtosis: 1.94601
skewness: 0.14245
Rendering intent: Perceptual
Gamma: 0.454545
Chromaticity:
red primary: (0.64,0.33)
green primary: (0.3,0.6)
blue primary: (0.15,0.06)
white point: (0.3127,0.329)
Interlace: None
Background color: white
Border color: srgb(223,223,223)
Matte color: grey74
Transparent color: black
Compose: Over
Page geometry: 227x227+0+0
Dispose: Undefined
Iterations: 0
Compression: JPEG
Quality: 75
Orientation: Undefined
Properties:
date:create: 2020-03-17T15:09:12+00:00
date:modify: 2016-09-02T07:01:19+00:00
jpeg:colorspace: 2
jpeg:sampling-factor: 2x2,1x1,1x1
signature: b74a2e95ab2c5990d8da57c24705af34221be10b7db1d93d0153241429216bc6
Artifacts:
filename: data/lpz_2012-2014/raw/FA14/JNT/J01-LMP1/J01-LMP1-FA14 (11).JPG
verbose: true
Tainted: False
Filesize: 24.6KB
Number pixels: 51.5K
Pixels per second: 0B
User time: 0.000u
Elapsed time: 0:01.000
Using only the 2016-2017 data is very limiting because it is only from mid-summer. I wouldn't expect models trained on just this data to generalize to other times of year, and indeed we have seen substantial performance drops on images from other times of year. We have data from all seasons from 2012-2014. It is formatted differently but contains roughly the same information. Putting these datasets together and training on the result is the lowest-hanging fruit for providing more value with this project.