opendatabio / opendatabio-r

R package for interacting with OpenDataBio
https://opendatabio.gitlab.io/
GNU General Public License v3.0
3 stars 2 forks source link

Location imports - parent recognition and location search #17

Closed betovicentini closed 6 years ago

betovicentini commented 6 years ago

odb_import_locations is not recognizing parent during importation of different admin_levels of countries. I first imported level 0 for Bolivia, then batch imported all levels 1 for the country. Bolivia was not recognized as parent for level 1 admin units. Therefore, I must inform parent during importation of level 1. This then should work (I will write the import script considering this), but is not ideal (OBS: this is not possible as import script does not recognize parent_id field in data frame).

Location importation should consider:

betovicentini commented 6 years ago

Also odb_get_locations does not recognize 'adm_level=0' in the parameter list. This implies that during batch import of a country and subunits, it makes it difficult to check directly if the data is already imported, one must use the levelName instaed, which is not what comes with the GDAM. Results should include Parent information (name and id) for any given locality.

When I search:

pr = odb_params(list(search="Colombia",adm_level=0))
tt = odb_get_locations(pr,cfg)

#get:
 id     name    levelName
1 5503 Colômbia Municipality
2 7398 Colombia      Country
betovicentini commented 6 years ago

Also, to avoid duplicate insertion, while doing a hierarchical upload from level 0 upwards, it would be important to be able to search for a given location name given a particular PARENT_ID. However, the odb_get_location function does not accept parent_id=id in the parameter list. This needs to be implemented, so searches can be easily filtered. Given that parent_id is not returned by the function, it is also impossible to check this a posteriori.

betovicentini commented 6 years ago

Moreover: The odb_get_location search options should allow variations of the kind:

search='%colombia%'
search="colombia" only full matches
search="colombia%"

This could either be implemented as an additional parameter to the function, or perhaps preferably, the construct customized by the user. The user could enter SQL statments? like: LIKE %colombia or variations, even NOT LIKE %colombia. We need flexibility but also easiness, so if a sql statment is written, then perhaps a parameter sqlwhere, would give alternative possibilities of finding records.

andrechalom commented 6 years ago
andrechalom commented 6 years ago

Update:

> odb_get_locations(list(search="Assis", fields=c("name", "parent_id", "fullname")), cfg)
andrechalom commented 6 years ago

Finally, "The user could enter SQL statments? like: LIKE %colombia or variations, even NOT LIKE %colombia." The answer is: that's very complicated to do while preserving some measure of security.

betovicentini commented 6 years ago

odb_import_locations is not recognizing parent during importation - Fixed in the latest development version, please test.

Tested this and had mixed results.

So it seems that parent detection is not working properly every time. The only thing I may have noticed is that if the importation process is ongoing and I look at the location table in the interface, then it seems if fails parent recognition, but if I wait to look at results after all levels are imported, then parents were recognized. Not sure this is the reason.

betovicentini commented 6 years ago

In your list above there is also something you forgot:

andrechalom commented 6 years ago

This is already inplemented in the current code, can you test it? [EDIT: tested, it is working]

andrechalom commented 6 years ago

Notice that the dataframe column is "parent", not parent_id

On Fri, Mar 23, 2018, 12:24 André Chalom andrechalom@gmail.com wrote:

This is already inplemented in the current code, can you test it?

On Fri, Mar 23, 2018, 11:59 Alberto Vicentini notifications@github.com wrote:

In your list above there is also something you forgot:

  • odb_import_locations should be able to use a parent_id value if that is provided, in which case parent detection is not needed. This would allow better control during importation, allowing users to place a location under a parent.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/opendatabio/opendatabio-r/issues/17#issuecomment-375692979, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5w9cc1T_1NuiofNPhGbeCq5h_RmOzZks5thQ3PgaJpZM4S2-nJ .

betovicentini commented 6 years ago

I tried importing the following data.frame, states for Bolivia:

head(toimport[,-5]) #excluded geom
     name adm_level parent datum
1 Chuquisaca         1      2 WGS84
2 Cochabamba         1      2 WGS84
3    El Beni         1      2 WGS84
4     La Paz         1      2 WGS84
5      Oruro         1      2 WGS84
6      Pando         1      2 WGS84

Parent was not assigned and parent_id remained NULL in the mysql locations table.

Moreover, I ran odb_import_locations(toimport, cfg) again, and these records were duplicated (#14 not solved), but this second time, parent was assigned correctly.

To make sure I've done nothing wrong in the first place. I repeated the test above, step by step, and got similar results, only in this case the second try to import level 1 admin also did not recognized parent. Again records were duplicated.

So it seems that both parent detection and parent informed are not working properly. At least in the second, parent should be correctly linked.

betovicentini commented 6 years ago

Ops, on previous comment, my error (1 line in code screwing it up, informing parent_id=1 for country level). So parent informed importation seems to be working correctly and if you try to import again, importation fails, and I get the warning in the job WARNING: location Chuquisaca already imported to database.. so, this option is working ok. My mistake.

betovicentini commented 6 years ago

parent detection, also seems to work, but parent is not necessarily identified. One problem that I've noted is that I am using getData('GADM', country=crt$Codes[t], level=lv) to get the data to import and this comes with a parent column, which is a factor with parent names. Could this be a problem?Removing this column from the data.frame made the function "partially" work. I got parent detected, but for Colombia the parent was not correctly detected: Municipalities (level 2) were assigned to only two States (level 1), while in the original data all states have municipalities. Spatial detection config?

betovicentini commented 6 years ago

So it seems that both parent detection and parent informed are not working properly. At least in the second, parent should be correctly linked.

I tested again the "parent informed" script, first with Bolivia (ok, worked), then with Colombia ( parent not assigned for level 2, only for level 1), but they were informed. Tested again twice, using the same script and parent information. The problem persist. Even when parent is informed, it is not assigned by the job.

Here is the script used:

cfg = odb_config(base_url=base_url, token = token)
countries = c("BOL", "BRA", "COL", "CRI", "ECU", "GUF", "GUY", "NIC", "PAN", "PER", "SUR", "VEN")
for(t in c(1,3)) {
  #define o comeco
  lv = 0
  ff = 0
  parent_id = 0
  parid = 0
  while(ff==0) {
    importa=FALSE
    ocrt <- try(getData('GADM', country=countries[t], level=lv),silent = T)
    if (class(ocrt)=="try-error") {
      #neste caso acabou para o pais, interrompe o loop
      ff=1
      break
    }
    #pega as localidades a serem cadastradas
    toimport = sp_to_df(ocrt)
    if (lv==0) {
      toimport$parent = parent_id
    }
    else {
      #portanto parid ja deve existir e deveria ter o mesmo que os valores da coluna lv-1
      cln2 = paste("NAME",(lv-1),sep="_")
      parnames = ocrt[[cln2]]
      parent_id = parid[parnames]
      #prepara data.frame para importar
      toimport$parent = parent_id
    }
    #toimport$parent_id
    op = toimport$parent
    print(paste("importing",nrow(toimport),"records for country",countries[t],"admin level",lv))
    print(paste("parent is ",class(op)," with ",sum(is.na(op)),"NA values and ",length(unique(op[!is.na(op)])),"unique parent values"))
    odb_import_locations(toimport, cfg)
    #dorme um pouco para garantir que o servidor processe a informacao
    Sys.sleep(60) #needs to be large when back importing many things

    #pega o ultimo job id
    jobids = odb_get_jobs(list(),cfg)
    jobids = lapply(jobids,unlist) #reformata output para data.frame normal
    jobid = jobids$id[order(jobids$id,decreasing = T)][1]
    #log do job
    idt = odb_get_log(jobid,odb_cfg = cfg)
    parid = odb_get_affected_ids(jobid, odb_cfg = cfg)
    if (!nrow(toimport)==length(parid)) {
      print(paste("Dos",nrow(toimport),"apenas", length(parid),"foram importados"))
      break
    } else {
      names(parid) = as.vector(toimport$name,mode='character')
    }
    Sys.sleep(10)
    lv=lv+1
  } #end while
} #end for each country
betovicentini commented 6 years ago

I've checked the import function in the app folder and noted the following line:

 // forces null if parent / uc was explicitly passed as zero
 'parent_id' => (0 == $parent ? null : $parent),

However, theres is a previous line:

if (0 == $adm_level) {
            $world = Location::world();
            $parent = $world->id;
        }

Could be that when adm_level==0 then $parent becomes 0 because world has 0 id and then it is set to null for the country, which then might be causing the problem. Otherwise, the script seems ok to me. but it is not working still.

andrechalom commented 6 years ago

@betovicentini, I am unable to reproduce the errors you mention in this thread, because the script you provided does not run at all. But the intermitent errors suggest that the problem is related to some cache feature. Are you sure that:

Moreover, it's very difficult to follow what your script is doing, because it has a lot of code that's not related to the present error, and all of your variable names are meaningless (what does ff mean? What does op mean?) Please try to provide minimal working scripts for the bug reports, with more informative variable names and function names.

On Fri, Mar 23, 2018, 15:07 Alberto Vicentini notifications@github.com wrote:

I tried importing the following data.frame, states for Bolivia:

head(toimport[,-5]) #excluded geom name adm_level parent datum 1 Chuquisaca 1 2 WGS84 2 Cochabamba 1 2 WGS84 3 El Beni 1 2 WGS84 4 La Paz 1 2 WGS84 5 Oruro 1 2 WGS84 6 Pando 1 2 WGS84

Parent was not assigned and parent_id remained NULL in the mysql locations table.

Moreover, I ran odb_import_locations(toimport, cfg) again, and these records were duplicated (#14 https://github.com/opendatabio/opendatabio-r/issues/14 not solved), but this second time, parent was assigned correctly.

To make sure I've done nothing wrong in the first place. I repeated the test above, step by step, and got similar results, only in this case the second try to import level 1 admin also did not recognized parent. Again records were duplicated.

So it seems that both parent detection and parent informed are not working properly. At least in the second, parent should be correctly linked.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/opendatabio/opendatabio-r/issues/17#issuecomment-375753737, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5w9aNQ5bmJWKE99-p3fJpE1t8yTRDXks5thTnFgaJpZM4S2-nJ .

andrechalom commented 6 years ago

Also, the world object cannot have id zero, as it's either created with id 1 in the migration process, or imported with id greater than 8000 during the database seed. But you can check it on the MySql prompt with SELECT id FROM locations WHERE name ='World';

On Sat, Mar 24, 2018, 12:14 André Chalom andrechalom@gmail.com wrote:

@betovicentini, I am unable to reproduce the errors you mention in this thread, because the script you provided does not run at all. But the intermitent errors suggest that the problem is related to some cache feature. Are you sure that:

  • you have the latest development version of the code?
  • you have run php install to clear the php caches?
  • you have restarted supervisor to clear the job caches?

Moreover, it's very difficult to follow what your script is doing, because it has a lot of code that's not related to the present error, and all of your variable names are meaningless (what does ff mean? What does op mean?) Please try to provide minimal working scripts for the bug reports, with more informative variable names and function names.

On Fri, Mar 23, 2018, 15:07 Alberto Vicentini notifications@github.com wrote:

I tried importing the following data.frame, states for Bolivia:

head(toimport[,-5]) #excluded geom name adm_level parent datum 1 Chuquisaca 1 2 WGS84 2 Cochabamba 1 2 WGS84 3 El Beni 1 2 WGS84 4 La Paz 1 2 WGS84 5 Oruro 1 2 WGS84 6 Pando 1 2 WGS84

Parent was not assigned and parent_id remained NULL in the mysql locations table.

Moreover, I ran odb_import_locations(toimport, cfg) again, and these records were duplicated (#14 https://github.com/opendatabio/opendatabio-r/issues/14 not solved), but this second time, parent was assigned correctly.

To make sure I've done nothing wrong in the first place. I repeated the test above, step by step, and got similar results, only in this case the second try to import level 1 admin also did not recognized parent. Again records were duplicated.

So it seems that both parent detection and parent informed are not working properly. At least in the second, parent should be correctly linked.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/opendatabio/opendatabio-r/issues/17#issuecomment-375753737, or mute the thread https://github.com/notifications/unsubscribe-auth/AB5w9aNQ5bmJWKE99-p3fJpE1t8yTRDXks5thTnFgaJpZM4S2-nJ .

andrechalom commented 6 years ago

OK, I was finally able to run this code. The problem is that "try(getData('GADM', country=countries[t], level=lv),silent = T)" was failing silently because I hadn't loaded the raster library. So, I have some suggestions for the code:

  1. I'm running locally

    ocrt <- raster::getData('GADM', country="BOL", level=0)
    odb_import_locations(sp_to_df(ocrt), cfg)

    This works with no more duplication for countries.

  2. When importing level 1 deparments, I run

    ocrt <- raster::getData('GADM', country="BOL", level=1)
    odb_import_locations(sp_to_df(ocrt), cfg)

    It works with no problems, and if I try to run it again, all departments are identified as "already imported to database". Notice that sp_to_df is a function defined in the "opendatabio" package, which assigns parents to each location.

  3. Running the code with no parent informed, I removed the department data for Bolivia, then run:

    ocrt <- raster::getData('GADM', country="BOL", level=1)
    my.sp = sp_to_df(ocrt)
    odb_import_locations(my.sp[,-3], cfg) # which removes the "parent" column

    This works, and all departments are correctly assigned under Bolivia.

Closing the issue as the code is working as intended. Please try to clear caches as described above, as this is most certainly the cause of all of the mixed results.

andrechalom commented 6 years ago

Finally, instead of "sleeping" for a minute, I recomend using the assigned job id and checking the progress == 100%. You can get the job object with

> job = odb_import_locations(sp_to_df(ocrt), cfg)
> job$id
[1] 36