jamesshocking / Spark-REST-API-UDF

Example of how to leverage Apache Spark distributed capabilities to call REST-API using a UDF
MIT License
50 stars 18 forks source link

Running in Databricks, results are empty. #3

Open TomBigDataWilson opened 2 years ago

TomBigDataWilson commented 2 years ago

Hello James:

First, thank you for doing this, this is an excellent example. When I run this, the results_df has 4 nulls in the execute/result column. When I followed the readme, that example names the result structure as result. The code names it execute. No biggie, just explaining why I have two different names for the same column.

This is the request_df (before the udf is executed). +----+---------------------------------------------------------------+----------------------------------+----+ |verb|url |headers |body| +----+---------------------------------------------------------------+----------------------------------+----+ |get |https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json|{content-type -> application/json}|{} | +----+---------------------------------------------------------------+----------------------------------+----+

It takes about 10 seconds to run the udf, and this is the result. +----+---------------------------------------------------------------+----------------------------------+----+------------------------+ |verb|url |headers |body|result | +----+---------------------------------------------------------------+----------------------------------+----+------------------------+ |get |https://vpic.nhtsa.dot.gov/api/vehicles/getallmakes?format=json|{content-type -> application/json}|{} |{null, null, null, null}| +----+---------------------------------------------------------------+----------------------------------+----+------------------------+

Based on the null, null, null,null in the result/execution column, it appears like the udf is not getting the right parameters for the call, but based on the run time, it seems like the udf is doing work of some type. Any thoughts or suggestions?

Thank you!

jamesshocking commented 2 years ago

Hi Tom

My apologies for the late response. It turns out that the US government API that I used in the sample code is blocking any HTTP request, and so when the UDF is executed, an exception is being thrown when the code executes the HTTP request.

Try switching to a different REST service and all should be okay.

Thanks for letting me know. I will update the documentation to use a different API endpoint.

Kind regards James

TomBigDataWilson commented 2 years ago

Hi James, Thank you for your fast response. I'll try a different service and see what happens!

I'm not 100% on the cause of the problem, as I was able to paste the api call into a web-browser and got data that fits the struct-type in the solution. Maybe it is something specific with Requests or something else. But, I will definitely use this as an example for my project, and proceed with my project-related api call.

Thanks much! Thomas

On Tue, Oct 11, 2022 at 6:20 AM James Hocking @.***> wrote:

Hi Tom

My apologies for the late response. It turns out that the US government API that I used in the sample code is blocking any HTTP request, and so when the UDF is executed, an exception is being thrown when the code executes the HTTP request.

Try switching to a different REST service and all should be okay.

Thanks for letting me know. I will update the documentation to use a different API endpoint.

Kind regards James

— Reply to this email directly, view it on GitHub https://github.com/jamesshocking/Spark-REST-API-UDF/issues/3#issuecomment-1274464723, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALBNX2G6BFYKQPROTIV2NBTWCU5PTANCNFSM6AAAAAAQ67TYYQ . You are receiving this because you authored the thread.Message ID: @.***>

jamesshocking commented 2 years ago

I believe that the issue relates to the HTTP USER_AGENT that the Requests object uses when making HTTP requests (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent). I believe that the US Government web service is using the USER_AGENT at the network level to refuse connections that identify themselves in a certain way.

When you load the Service endpoint using MS Edge Mobile for example, the browser will use the USER_AGENT

Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N)
AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/90.0.4430.85
Mobile Safari/537.36
EdgA/90.0.818.46

whereas the Python requests library will not use anything. When you load the service directly in your desktop web browser, the service works as expected. In contrast the service denys the network request/HTTP request when the USER_AGENT isn't something that it wants to support. It would be simple for the web service to refuse all requests where the USER_AGENT had not been set.

I haven't tested this idea but it fits the why behind one environment is working and the other does not. It wouldn't take much to test it though.

In the code, I instantiate an empty dictionary called header. If you change this instantiation to

header = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Mobile Safari/537.36 EdgA/90.0.818.46' }

Try the Request query again and see if it works. Try calling the requests function directly and outside of the UDF. It will tell you quickly if the request will be accepted by the foreign service or not.