google-code-export / unm-macroecology-2012

Automatically exported from code.google.com/p/unm-macroecology-2012
1 stars 0 forks source link

Data Check #32

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hello!

So for those of you who missed the data team meeting today, here's a quick 
summary: We decided to find info on public vs. private, city demographics, and 
also assign columns for the data check.

So here are the assignments:
Dan: B, C, D, E
Christian: F, G, H, I
Mary: J, K, L, M
Tracy: N, O, P, Q
Xiaoben: R, S, T, V, Y
Jason: Z, AC, AE, AF, AG
Kevin: AH, AJ, AK, AN, AS
Libby: AT, AU, AZ, BA, BB
Adeline: BC, BD, BE, BI, BJ

The file has already been uploaded in the Downloads link. Basically, we're just 
looking for obviously erroneous entries (such as the example of 57,000,000 vs. 
57). If you find an error, please make note of the change (and location of 
cell) in the Meta_Data sheet (sheet 2 of the same excel file). Also, if you 
have any doubts whatsoever about whether or not something is a mistake, please 
also include that in the Metadata.

Due date? I'd say Tuesday.

-Adeline

Original issue reported on code.google.com by adelinem...@gmail.com on 17 Feb 2012 at 2:03

GoogleCodeExporter commented 9 years ago
Anyone knows how to make a judgement the data have errors or not? If two 
universities have same numbers in whatever items, I would think it is an error. 

Anything else?

Original comment by sdpa...@gmail.com on 17 Feb 2012 at 8:54

GoogleCodeExporter commented 9 years ago
I judged errors on the basis of the surrounding information.  There were a few 
numbers that seemed very low but again I am unsure weather it is a problem or a 
real data point.  Any recommendations?
EM

Original comment by libbymon...@gmail.com on 17 Feb 2012 at 9:02

Attachments:

GoogleCodeExporter commented 9 years ago
I'd say make a note of anything that looks suspicious in the metadata.
Just pointing out what *might* be wrong should help...
If you think you know what it *should* be, then change it.

Original comment by icos.atr...@gmail.com on 18 Feb 2012 at 4:55

GoogleCodeExporter commented 9 years ago
I'm unsure of how public vs. private, city demographic, etc will be assigned?  
Same as zip codes?
Thanks!

Original comment by tracy.a....@gmail.com on 18 Feb 2012 at 5:02

GoogleCodeExporter commented 9 years ago
@Tracy -- it looks like Mary can get those out of a public database, so we 
don't need to enter it or error-check it (see Mary's "NCES" issue).  This issue 
is just for error-checking columns. 

Original comment by icos.atr...@gmail.com on 18 Feb 2012 at 5:08

GoogleCodeExporter commented 9 years ago
Does any one have any standards to check those errors? I worry that I change 
the data which are good.

Also, should I fix the errors on the originally data sheet and highlight it or 
I input the new data in the matadata sheet.

Original comment by sdpa...@gmail.com on 18 Feb 2012 at 4:25

GoogleCodeExporter commented 9 years ago
If you think you know what it *should* be, change it in the original data 
sheet, and note the change in the metadata sheet.

If it looks suspicious, then make a note in the metadatasheet.

Don't highlight anything.  These won't necessarily be processed in Excel.

Original comment by icos.atr...@gmail.com on 19 Feb 2012 at 3:21

GoogleCodeExporter commented 9 years ago
I've got some huge variation in my data, but I'm not sure if these are errors. 
This includes many not reported gaps. What should I do?

Original comment by SKMcCorm...@gmail.com on 20 Feb 2012 at 11:45

GoogleCodeExporter commented 9 years ago
I'm not sure where to make a change in the metadata, so I hope this works for 
everyone.
I didn't notice any obvious errors in N (full-fac), O (part-fac), P 
(full-staff), Q (part-staff).  I tried to check current reports on the 
CollegeStats.org and most seemed to be within a reasonable range of the current 
stats reported.  There were several blanks, which I left empty.  Some school 
only reported either full time faculty and staff or only full time faculty.  
Only an observation I wanted to note, not sure if it is a real concern or not.

Original comment by tracy.a....@gmail.com on 21 Feb 2012 at 4:14

GoogleCodeExporter commented 9 years ago
I don't seen any obvious errors for my data either (R, S, T, V, Y).

Xiaoben

Original comment by sdpa...@gmail.com on 21 Feb 2012 at 4:16

GoogleCodeExporter commented 9 years ago
I did find one potential error, AE164 seems to be incorrect. Other than that 
everything looks ok.
~Jason

Original comment by jayco...@gmail.com on 21 Feb 2012 at 5:49

Attachments:

GoogleCodeExporter commented 9 years ago
I found quite a few errors in duplicate reporting, incorrect years, etc. So I 
modified the data file and listed all of my changes in the metafile. Hope the 
format I used in the metafile works, if not, let me know I and I can change it.

Original comment by drcolm...@gmail.com on 21 Feb 2012 at 5:53

Attachments:

GoogleCodeExporter commented 9 years ago
@Kevin My understanding is that for alot of the values you're looking at, some 
schools may not be using those particular types of energy source, or they may 
just not report them and/or don't have data for them. I think it'd be best to 
treat them as either not reported or no data available when no values are 
given. I'm working on writing up a summary of what goes into each piece of the 
data, how it's calculated, etc. so some of this is more clear to the analysis 
committee - should be done with it in the next couple days. 

Original comment by drcolm...@gmail.com on 21 Feb 2012 at 5:58

GoogleCodeExporter commented 9 years ago
Hey guys, these are the changes I made:

I added everything to the two files Dan sent.

-Adeline

Original comment by adelinem...@gmail.com on 21 Feb 2012 at 7:18

Attachments:

GoogleCodeExporter commented 9 years ago
Here are my corrections. I found a few errors, but for the majority of the data 
the only problem may be the zero scores, which for the larger colleges may 
indicate a non-reporting year.

Original comment by SKMcCorm...@gmail.com on 21 Feb 2012 at 7:52

Attachments:

GoogleCodeExporter commented 9 years ago
Sorry I haven't finished these yet.  Not forgotten.  Is someone planning on 
merging these together?

Original comment by icos.atr...@gmail.com on 23 Feb 2012 at 11:43

GoogleCodeExporter commented 9 years ago
My checked data is attached. I did not change anything. The problems are listed 
in the metadata next to the appropriate line number. The problems I saw were 
decimals in student enrollment, same enrollment listed for several years, and 
for a few institutions the breakdown of enrollment exceeded the total 
enrollment. I can check some of these things with our other data sources. If I 
change anything, I will submit an updated worksheet.

Mary 

Original comment by marymai...@gmail.com on 23 Feb 2012 at 3:28

Attachments:

GoogleCodeExporter commented 9 years ago
I can compile it!
-Adeline

Original comment by adelinem...@gmail.com on 23 Feb 2012 at 5:58

GoogleCodeExporter commented 9 years ago
So here's the merged data. I've got everyone's corrections in, except for 
Christian's. I suggest everyone take a look at the META data file. I've 
highlighted a few suspicious entries that should be fixed, or at least looked 
over. 

I changed all the location references so that they match the last file that Dan 
sent. But when that was too much of a pain, I indicated the school+year as 
reference to the main file. I also included relevant comments from this thread.

-Adeline

Original comment by adelinem...@gmail.com on 27 Feb 2012 at 4:54

Attachments:

GoogleCodeExporter commented 9 years ago
For my two highlighted comments I'm not sure what to do. I very much doubt that 
all zeros reported actually mean a zero measurement. For many of these I 
believe that they represent non-reported scores, ans as such may create an 
irregular distribution. 

Original comment by SKMcCorm...@gmail.com on 27 Feb 2012 at 6:59

GoogleCodeExporter commented 9 years ago
Maybe we should fix zero as N/A. I don't think it will distort the
whole picture and statistical software will ingore it. Or ,we can let
the data group decide.

Original comment by sdpa...@gmail.com on 27 Feb 2012 at 4:14

GoogleCodeExporter commented 9 years ago
Yeah I agree with Xiaoben. 

There are still 2 double entries (last two highlighted items). Do you want to 
fix those, Dan? Since you deleted all the other doubles, I feel it would be 
more consistent for you to judge which to keep. Also, you should probably leave 
the deleted entries as blank rows so that it doesn't shift all the reference 
locations in the meta data.

And as for the negative values...should we assume they are supposed to be 
positive, or discard them altogether?

The remaining highlighted items deal with outrageously high numbers. We should 
probably do a little background check to make sure they're acceptable.

-Adeline

Original comment by adelinem...@gmail.com on 27 Feb 2012 at 4:36

GoogleCodeExporter commented 9 years ago
Ya I'll fix the double entry and make sure not to delete the rows (sorry - lack 
of forward thinking on that one). Was there alot of negative values for 
different variables? If it's a small portion, we can do some spot checking on 
them to see if they're clearly supposed to be positive or if it's too ambiguous 
to call and just discard them.

Original comment by drcolm...@gmail.com on 28 Feb 2012 at 4:31

GoogleCodeExporter commented 9 years ago
Attached are my checking.  Just a few edits and some comments.

Original comment by icos.atr...@gmail.com on 2 Mar 2012 at 10:01

Attachments:

GoogleCodeExporter commented 9 years ago
So I finally have everyone's corrections. I've updated the meta data file, 
highlighting new issues that have arisen. I also un-highlighted problems that 
we've solved, specifically those pertaining to the data columns we decided not 
to keep. So again, I recommend you guys check out the meta data file to see if 
you can make any corrections.
-Adeline

Original comment by adelinem...@gmail.com on 2 Mar 2012 at 6:10

Attachments:

GoogleCodeExporter commented 9 years ago
The finished file is here:
http://unm-macroecology-2012.googlecode.com/files/ACUPCC-clean-finished-allcols.
csv

Final metadata, including list of changes and suspect entries, is here:
http://unm-macroecology-2012.googlecode.com/files/FIN_acupcc_Meta.csv

Original comment by icos.atr...@gmail.com on 6 Mar 2012 at 6:03