google-code-export / sandy-disaster-recovery

Automatically exported from code.google.com/p/sandy-disaster-recovery
2 stars 2 forks source link

Import Work Orders Feature #133

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
There needs to be a method to import hundreds/ thousands of records using a 
combination of a standard, downloadable .csv or .xls/.xlsx file (which updates 
as the database schema is updated) and wizard to fix records with errors in 
addresses, Claimed organization, status, etc. The feature must be very fault 
tolerant if it is going to be available to all participants.

Alternatively, a less fault-tolerant import feature could be implemented if 
available ONLY to the administrator.  The system would rely upon the 
administrator to fix any logical errors in the combination of status and claim.

This is a very high priority.

Original issue reported on code.google.com by v...@aarontitus.net on 13 Jan 2013 at 9:56

GoogleCodeExporter commented 9 years ago
From an email:
Just to be clear, the requirements are:
Q1: You have a Excel/CSV file that contains data in a standard structure based 
on the database schema. This structure is not standardised yet, and after it is 
standardised, we should update it if we change the schema.

A1: Yes, but it is more likely that each incident will have a dynamic schema, 
so we shouldn’t wait for a standardized schema.  Instead, we should probably 
create a tool that allows you to download a template for the “current” 
schema, fill it out, and upload it.  Andy Gimma should be able to provide a 
little more detail about that.

If you upload this file to the server, then:
Q2: All the records in the file are added to the database if they pass 
validation; otherwise a wizard that walks the user through fixing each error is 
shown. Is this correct?

A2: Yes. That sounds correct.  Validation consists of: 1. Is it a valid address 
that Google Maps can locate? 2. If so, is it a duplicate? See Issue 83. If it 
fails validation, then give the user the opportunity to fix the addresses, or 
merge duplicates.

Original comment by v...@aarontitus.net on 25 Jan 2013 at 4:21

GoogleCodeExporter commented 9 years ago
Process:
1. Org End User clicks "Download Incident Template"
2. System queries the relevant Incident Assessment Form to get 
required/relevant fields.
3. System creates blank csv template based upon fields in #2.  Each field 
should contain a dummy example information, e.g. "John Smith, 123 Main St., 
Anytown, PA, 12345..." etc."
4. Org End User downloads blank template locally.
5. Org End User manually normalizes data he/she intends to import into the 
blank template.
6. Org End User sends the unvalidated CSV to Super Admin and/or Incident Admin.
   5.1 "Send" could mean email, upload, etc.
7. Admin manually validates info (e.g. visually checks to make sure that dates 
aren't entered into the wrong field).
8. Admin uploads CSV
9. System Validates CSV and handles exceptions
   9.1 Validations
       9.1.1 Address Validation: Use Google Maps API to validate address and calculate lat/long.
       9.1.2 Duplicate detection: Use algorithms developed for Issue 83.
       9.1.3 Data type validation: Is it a date/time Boolean, etc.  Note: "Yes" (case-insensitive) and "True" (case-insensitive) should all be interpreted as Boolean TRUE. "No" (case-insensitive), "False" (case-insensitive), and NULL should be interpreted as Boolean FALSE.
       9.1.4 Ignore Dummy example information (See #3 above)
   9.2 If a record passes validation, System enters the record as a new Work order.
   9.3 Exception Handling:
       9.3.1 If fail on a validation in 9.1, add column on extreme right with error message.
       9.3.2 Compile all failed records into a new csv file with error messages on extreme right column.  Send file back to Super Admin.

Original comment by v...@aarontitus.net on 26 Jan 2013 at 7:46

GoogleCodeExporter commented 9 years ago
I am currently looking at this (and planning for duplicate detection - issue 83 
- to come afterwards).

Original comment by cpw...@gmail.com on 7 Feb 2013 at 1:20

GoogleCodeExporter commented 9 years ago
Please see attached screenshot for a partially complete administrator tool for 
this.

In this approach, the Admin will have the option at the foot of the page to
(a) edit (replace) any erroring values before submission, or
(b) download an CSV file with the errors annotated,
(c) choose to submit only the valid entries and download the remaining invalid 
CSV.

This is a mix of #1 and #2 above - is this preferable, or should the form UI be 
dropped in favour of CSV input/output only?

Original comment by cpw...@gmail.com on 7 Feb 2013 at 6:28

Attachments:

GoogleCodeExporter commented 9 years ago
This approach is GREAT!
I'd like to see the edit feature, to make sure that it's user-friendly enough.

Original comment by v...@aarontitus.net on 7 Feb 2013 at 7:24

GoogleCodeExporter commented 9 years ago
To edit a row (prior to submission), the idea is to use the form controls as 
shown in the screenshot - controls appear for next to the fields with errors 
(in red).

Better than separate UI? This is in-place.

Original comment by cpw...@gmail.com on 7 Feb 2013 at 7:50

GoogleCodeExporter commented 9 years ago
Cool.  Not pretty yet... but it doesn't need to be!  Looks good. Excited to 
test.

Original comment by v...@aarontitus.net on 7 Feb 2013 at 9:40

GoogleCodeExporter commented 9 years ago
Also, we would probably only want to display results with errors; rather than 
all of them.

Original comment by v...@aarontitus.net on 7 Feb 2013 at 11:09

GoogleCodeExporter commented 9 years ago
Perhaps that should be an option when viewing the page? i.e. show all / show 
only errors

Showing all the fields allows manual validation by the Admin, meaning you 
wouldn't need to check the CSV before upload (as in #2 above).

Original comment by cpw...@gmail.com on 8 Feb 2013 at 10:40

GoogleCodeExporter commented 9 years ago
I have submitted a version for testing - r186 - and it is available on the 
testbed app.

It is missing the following features discussed above:
- edit (replace) any erroring values before submission
- only displaying results with errors

Also, it handles only the current model of Sites (i.e. without phases).

Original comment by cpw...@gmail.com on 25 Feb 2013 at 7:20

GoogleCodeExporter commented 9 years ago
It's available to the Global Admin only in the Admin interface.

Original comment by cpw...@gmail.com on 25 Feb 2013 at 7:22

GoogleCodeExporter commented 9 years ago
Attempted to upload a single  and failed. On upload, received the following 
message (in total):
https://sandy-helping-hands.appspot.com/admin-import-csv

<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>500 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered an error and could not complete your request.<p>If 
the problem persists, please <A 
HREF="http://code.google.com/appengine/community.html">report</A> your problem 
and mention this error message and the query that caused it.</h2>
<h2></h2>
</body></html>

I think we need a lot more exception handling. I know I was confused about how 
to fill it out.

1. Created Org, "MHH-Cherry Hill Stake"
2. Created Org, "NJ 2-1-1"
3. Uploaded attached file
4. 500 Error.

Original comment by v...@aarontitus.net on 25 Feb 2013 at 11:53

Attachments:

GoogleCodeExporter commented 9 years ago
Questions about fields:
* Is a Blank or NULL csv cell acceptable?
* For booleans, which of the following values are acceptable: TRUE, True, YES, 
Yes, yes, Y, y, FALSE, False, NO, No, no, N, n
* What date format(s) are acceptable?
* Until we modularize, will every question ever asked on any Incident show up 
on all import forms?
* Is it possible to start off by purging all of the fields that have never been 
used?

case_number: What happens if I don't leave this blank? E.g. what if I write 
"$295Mt" as the case number?
name: Are special characters allowed?
reported_by: Is a name, like "NJ 2-1-1" permissible?
claimed_by: Is a name, like "NJ 2-1-1" permissible?
request_date: Date format?
address:
city: 
county: What happens if I leave this blank?
state:
zip_code:
cross_street:
landmark:
phone1:
phone2:
time_to_call:
rent_or_own:
work_without_resident:
older_than_60:
disabled:
special_needs:
electricity:
standing_water:
tree_damage: I don't think this field is used. We should eliminate it.
tree_damage_details: Is this field used?  If not, we should eliminate it.
habitable:
work_requested:
others_help:
debris_removal_only:
work_type: Does "Trees or Wind" need to be spelled precisely?
flood_height:
floors_affected:
carpet_removal:
ceiling_removal:
debris_removal:
broken_glass
drywall_removal
heavy_item_removal
standing_water
pump_needed
num_trees_down: What happens if I put a non-integer here? e.g. "15-20"
num_wide_trees: Same question.
roof_damage
tarps_needed
goods_and_services
tree_diameter: Is this used? if not, then remove.
electrical_lines
cable_lines
cutting_cause_harm
other_hazards
insurance
notes
latitude: I assume I can leave this blank, right?
longitude
priority
inspected_by
status
assigned_to
date_closed
total_volunteers
hours_worked_per_volunteer
event: Can this be left blank, or at least be pre-filled in the template?
initials_of_resident_present
member_of_assessing_organization
first_responder
hardwood_floor_removal
mold_remediation
appliance_removal
do_not_work_before
prepared_by
status_notes
derechos_work_type: What is this, and can we remove "derechos" from the 
database?

Original comment by v...@aarontitus.net on 26 Feb 2013 at 12:09

GoogleCodeExporter commented 9 years ago
Re #12: the 500 error is due to a couple of unhandled exceptions - will be 
fixed.

Wrt confusion: Instructions need to be added to the first admin page. - will do 
following agreement on the below.

Re #13: general questions:
* Is a Blank or NULL csv cell acceptable?

Blank fields are.

* For booleans, which of the following values are acceptable: TRUE, True, YES, 
Yes, yes, Y, y, FALSE, False, NO, No, no, N, n

All of these (in the next revision), plus '0' and '1'.

* What date format(s) are acceptable?

Open question: Should be American only, or force to international/ISO (e.g. 
2012-02-26) ?

* Until we modularize, will every question ever asked on any Incident show up 
on all import forms?

Initially, yes, but this could be configured away.

* Is it possible to start off by purging all of the fields that have never been 
used?

Yes, I have added this as issue 204.

More to follow.

Original comment by cpw...@gmail.com on 26 Feb 2013 at 12:06

GoogleCodeExporter commented 9 years ago
* case_number: What happens if I don't leave this blank? E.g. what if I write 
"$295Mt" as the case number?

Should this only be assigned by the system?

* name: Are special characters allowed?

They should be.

* reported_by: Is a name, like "NJ 2-1-1" permissible?
* claimed_by: Is a name, like "NJ 2-1-1" permissible?

Only if it can be looked up in the Organization table - though this is not 
implemented yet.

* county: What happens if I leave this blank?

This is ok as long as the address geocodes as a whole (and the Google geocoder 
is very forgiving).

* tree_damage: I don't think this field is used. We should eliminate it.
* tree_damage_details: Is this field used?  If not, we should eliminate it.

Moved to issue 204.

* work_type: Does "Trees or Wind" need to be spelled precisely?

The allowed work_type values should be in the instructions - they currently 
appear in the error messages (see screenshot in this thread).

* num_trees_down: What happens if I put a non-integer here? e.g. "15-20"
* num_wide_trees: Same question.

I am not sure about this. The field needs to be an integer at present.

We could parse for this biggest integer given??

* tree_diameter: Is this used? if not, then remove.

Moved to issue 204.

* latitude: I assume I can leave this blank, right?
* longitude

These should probably be hidden.

* event: Can this be left blank, or at least be pre-filled in the template?

This should definitely be hidden.

* derechos_work_type: What is this, and can we remove "derechos" from the 
database?

Moved to issue 205.

Original comment by cpw...@gmail.com on 26 Feb 2013 at 12:16

GoogleCodeExporter commented 9 years ago
* case_number: What happens if I don't leave this blank? E.g. what if I write 
"$295Mt" as the case number?
Q: Should this only be assigned by the system?
A: Yes.  Should be hidden.

Any fields that are: A) not referenced in a particular Incident, or B) Assigned 
solely by the system during import should be hidden.
Category B) includes: 

case_number
latitude
longitude
(latitude_blur, when we add it)
(longitude_blur, when we add it)
event

Original comment by v...@aarontitus.net on 26 Feb 2013 at 2:05

GoogleCodeExporter commented 9 years ago
Agreed - the four existing fields you named will be excluded/hidden.

Original comment by cpw...@gmail.com on 26 Feb 2013 at 2:52

GoogleCodeExporter commented 9 years ago
Blurred field names also now excluded as in patch on issue 148 (r192).

Original comment by cpw...@gmail.com on 26 Feb 2013 at 5:11

GoogleCodeExporter commented 9 years ago
A new version (100) that deals will everything here is now available on the 
testbed.

Original comment by cpw...@gmail.com on 26 Feb 2013 at 5:25

GoogleCodeExporter commented 9 years ago
Still todo: determine an approximate maximum number of rows that can be 
processed this way (before GAE intervenes).

Original comment by cpw...@gmail.com on 26 Feb 2013 at 5:29

GoogleCodeExporter commented 9 years ago
There is also no duplicate detection (issue 83).

Original comment by cpw...@gmail.com on 26 Feb 2013 at 5:29

GoogleCodeExporter commented 9 years ago
Re #20: I have tested it on localhost with up to 500 valid rows - it took more 
than one minute, which will cause a request timeout on the appengine platform.

The main cost is bulk geocoding of addresses.

It will be possible to use the Task Queue and the 'deferred' library to run the 
geocoding and, secondly, saving in the background - but this will mean storing 
the file and having the Admin user wait for a background process to complete.

Original comment by cpw...@gmail.com on 26 Feb 2013 at 6:07

GoogleCodeExporter commented 9 years ago
Blocking on removal of Derechos special case handling.

Original comment by cpw...@gmail.com on 26 Feb 2013 at 7:11

GoogleCodeExporter commented 9 years ago
This seems to be working, except that I keep getting an error related to Issue 
21, Comment #27.  We'll have to fix that error and try again.

Original comment by v...@aarontitus.net on 27 Feb 2013 at 2:22

GoogleCodeExporter commented 9 years ago
Importing 500 work orders or more will be common.  I had anticipated that there 
would be a Task Queue, and the admin could see a list of batches "In process", 
and "Completed," then click on the results links once the imported batches are 
complete.

This would mean storing the imports somewhat permanently.  The Admin should 
have the option to "Delete Completed Imports," and "Stop and Delete All 
In-Process Imports."

Does that make sense?  Can you visualize this?

...and I know that duplicate detection is on the way.

Original comment by v...@aarontitus.net on 27 Feb 2013 at 2:28

GoogleCodeExporter commented 9 years ago
Agreed regarding the need to for CSV processing to run as background batch, 
with admin control available. 

In the absence of general detection of duplicates, it may be necessary to mark 
which rows of a CSV file have been accepted or not (yet).

Original comment by cpw...@gmail.com on 27 Feb 2013 at 10:41

GoogleCodeExporter commented 9 years ago
Just to be clear:
http://code.google.com/p/sandy-disaster-recovery/issues/detail?id=21#c27
is blocking implementation of this.  I have been unsuccessful at importing 
because organizations have not been properly associated with the Incident.

Original comment by v...@aarontitus.net on 1 Mar 2013 at 10:22

GoogleCodeExporter commented 9 years ago

Original comment by cpw...@gmail.com on 4 Mar 2013 at 11:44

GoogleCodeExporter commented 9 years ago
Error:
Imported the attached .csv to Derecho Storm event in sandbox. No errors 
reported during import. The results page read, "Geocode OK".  However:
1. No lat/lon was entered. lat/long both read "0.0".
2. The total number of work orders did not increment up. It still says "2 work 
orders," and should say "3 work orders."
3. No Work Order Number was entered. 
https://sandy-helping-hands.appspot.com/sites shows the imported work order, 
but the work order number reads, "None:"
4. No county entered. It looks like the interaction with the Google Maps API is 
just faulty.
All other information appears to have been entered correctly.

Original comment by v...@aarontitus.net on 5 Mar 2013 at 5:11

Attachments:

GoogleCodeExporter commented 9 years ago
I confirm all of these and am working on them now.

Please note: testing against Derechos will not work fully because of existing 
block against issue 205.

Original comment by cpw...@gmail.com on 5 Mar 2013 at 11:28

GoogleCodeExporter commented 9 years ago
I think that the work order total is a separate issue - it's not always updated 
in other cases. Added as new issue 213.

Original comment by cpw...@gmail.com on 5 Mar 2013 at 11:33

GoogleCodeExporter commented 9 years ago
Background importing and saving + fixes for #29 above released as r201, version 
104 on the testbed.

I am testing with large imports on the testbed now.

Original comment by cpw...@gmail.com on 5 Mar 2013 at 12:39

GoogleCodeExporter commented 9 years ago
100 row CSV file (of identical records): analyses and saves ok

500 row CSV file: causes memory error - "Exceeded soft private memory limit 
with 156.738 MB after servicing 5 requests total"

Original comment by cpw...@gmail.com on 5 Mar 2013 at 12:59

GoogleCodeExporter commented 9 years ago
Outstanding bugs:

1. Tracking down the memory leak has proved difficult. Could it be in the SDK..?

Until fixed, there is protection against failure: import tasks are not 
automatically retried by the platform (although it really wants to) and a 
failure is shown to the Admin in the table.

2. Exceeding the quota of the Google (Maps) geocoder causes rows to fail 
validation. At present, they cannot be retried.

Original comment by cpw...@gmail.com on 5 Mar 2013 at 6:14

GoogleCodeExporter commented 9 years ago
Some numbers again:

100 row CSV file (of identical records): analyses and saves ok

400 row CSV file: analyses and saves ok

500 row CSV file: causes memory error - "Exceeded soft private memory limit..."

Original comment by cpw...@gmail.com on 5 Mar 2013 at 6:15

GoogleCodeExporter commented 9 years ago
I've been testing for the past couple of days. Is there any way we can 
automatically do the following whenever I'm uploading a large batch (e.g. 1000 
people in batches of ~350):
* Allow me to be able to upload many .csv files at the same time. Right now, 
once the total number of records uploaded exceeds the magic number, the system 
freezes.
* Analyze each .csv sequentially.
* I'm not sure what the maximum number of Google Maps queries is (e.g. 250 per 
hour?)... but whenever we're about to reach that, the analysis should 
automatically pause.

...this way, I can upload a ton of .csv files, and forget about them for a day 
or two until they're done.

Original comment by v...@aarontitus.net on 7 Mar 2013 at 3:00

GoogleCodeExporter commented 9 years ago
I really like the import feature! Working well.

Original comment by v...@aarontitus.net on 7 Mar 2013 at 3:10

GoogleCodeExporter commented 9 years ago
Strange error on sandbox:

Rockaways_crisiscleanup.org_import.csv  Hurricane Sandy Recovery    Saving... 
    274     5   279     
Rockaways_crisiscleanup.org_import2.csv     Hurricane Sandy Recovery    Saving... 
    270     12  275     
Rockaways_crisiscleanup.org_import3.csv     Hurricane Sandy Recovery    Saved all 
valid rows  273     8   273

With the first two, all rows were "Saved," including the invalid rows, but for 
the third one, only valid rows were saved.

Original comment by v...@aarontitus.net on 7 Mar 2013 at 3:14

GoogleCodeExporter commented 9 years ago
Related error:
invalids1.csv   Hurricane Sandy Recovery    Error: Analysis failed - too large  6 
    0   0

This is a very small file, but has failed twice.  Reason unknown.

Original comment by v...@aarontitus.net on 7 Mar 2013 at 3:16

GoogleCodeExporter commented 9 years ago
...oh, and invalids1.csv is a collection of CORRECTED work orders that were 
invalid.

Original comment by v...@aarontitus.net on 7 Mar 2013 at 3:17

GoogleCodeExporter commented 9 years ago
I will look in to the errors in the individual files. If they are confidential 
and thus can't be posted here, please could you email them to me?

Original comment by cpw...@gmail.com on 7 Mar 2013 at 4:21

GoogleCodeExporter commented 9 years ago
Re handling batches:

I have looked at the Google Maps geocoding limits today.

Surprisingly, given that Google Maps now limits per IP address (rather than API 
key, as it used to), there is no special handling of GAE, meaning that GAE apps 
use common quota by virtue of being on shared IP addresses (without proxies in 
place).

Given this, and the memory issues, it may be better to rewrite/refactor the 
import code to work in a row-by-row fashion - i.e. don't do anything in bulk 
and do all analysis in single row chunks. This will support a more robust flow 
closer to #36.

Original comment by cpw...@gmail.com on 7 Mar 2013 at 4:26

GoogleCodeExporter commented 9 years ago
BUG:
There seemed to be an error in invalids1.csv. I'm not sure what it was, but I 
assumed it occurred on Row 7, since only 6 rows were processed... twice in a 
row. I placed Row 7 at the end, and re-submitted. It then failed on record 
25... and I figured out the error.  That particular record has no address.  So 
instead of a geocode error, it throws a "file too large" error.

BUG:
After successful import of valid rows into the system, they do not appear on 
the map, nor in the map search.

Original comment by v...@aarontitus.net on 7 Mar 2013 at 4:31

GoogleCodeExporter commented 9 years ago
Re the map bug, was this using to the Derechos or Hattiebsurg incidents on the 
testbed? I have noticed this happening some times - as meant by #30 above.

Original comment by cpw...@gmail.com on 7 Mar 2013 at 4:47

GoogleCodeExporter commented 9 years ago
POTENTIAL BUG:
The status of both "Rockaways_crisiscleanup.org_import.csv" and 
"Rockaways_crisiscleanup.org_import2.csv" are "Saving..." even though it's 
obvious they're done.  Are these two processes tying up system 
resources/costing money?

Original comment by v...@aarontitus.net on 7 Mar 2013 at 4:47

GoogleCodeExporter commented 9 years ago
Re #45: the current testbed Task Queue is empty, so it's not currently using 
any resources.

There are lots of failures present in the logs, some of which I have not seen 
before, so some of the processes will be failing and not updating the state 
correctly as it stands.

Original comment by cpw...@gmail.com on 7 Mar 2013 at 4:53

GoogleCodeExporter commented 9 years ago
Note: Relevant from the logs:
admin_handler/admin_import_csv_handler.py", line 282, in validate_row
    del(validation_row_copy[field_name])
KeyError: 'address'

Original comment by cpw...@gmail.com on 7 Mar 2013 at 4:54

GoogleCodeExporter commented 9 years ago
Note: lots of transaction collisions - unknown cause (atm).

Original comment by cpw...@gmail.com on 7 Mar 2013 at 4:55

GoogleCodeExporter commented 9 years ago
re #44: I see not - all tests are on the Sandy and Test incidents.

Original comment by cpw...@gmail.com on 7 Mar 2013 at 4:55

GoogleCodeExporter commented 9 years ago
FYI- we're going to have a data dump of ~800 work orders from Georgia that may 
come in as early as tomorrow (Friday).

Original comment by v...@aarontitus.net on 8 Mar 2013 at 2:50