popbr / data-integration

Apache License 2.0
1 stars 4 forks source link

Web Scraping issue #13

Open MNSleeper opened 1 year ago

MNSleeper commented 1 year ago

@aubertc , I have a bit of an issue with the web-scraping aspect of things. I was trying get a mini-implementation going, but couldn't get fileUtils to be recognized. But that's a problem I ran into before, and you sent a minimal app that I was able to get working in the past as a fix. I went and redownloaded it, fired it up, and it didn't work. The error I get it:

_> Exception in thread "main" java.lang.Error: Unresolved compilation problem:

    FileUtils cannot be resolved

    at com.mycompany.app.App.main(App.java:23)_

I don't understand why it just stops working. I know I was able to get a scraped HTML page, but now I won't even compile. I know thismight be old ground, but do you have an solutions/ideas why this is happening?

aubertc commented 1 year ago

I was trying get a mini-implementation

Please share the code, I can't do much without looking at it.

aubertc commented 1 year ago

I guess you mean this one?

package com.mycompany.app;
import org.apache.commons.io.FileUtils;
import java.io.File;
import java.io.IOException;
import java.net.URL;

public class App {
    public static void main(String[] args) {
        try {
            URL url = new URL("https://www.nsf.gov/awardsearch/download.jsp"); // don't just assign the string to the url: create a URL object with it.
            String Fname = "example.html"; // Give some extension to your file (here, I added ".html").
            File destination = new File(Fname); // You could avoid creating a string, but ok.
            FileUtils.copyURLToFile(url, destination);
        } catch (IOException e) { // You were not catching exceptions, which is weird to me.
            e.printStackTrace();
        }
    }
}
MNSleeper commented 1 year ago

My apologies. Yes, that was the implementation I tried to execute but failed.

aubertc commented 1 year ago

That may be because maven is not in charge of finding that library, and your "basic" java installation can't find that file.

To use maven to create a small test, do:

  1. mvn archetype:generate -DgroupId=com.popbr.fileUtilsTest -DartifactId=fileUtilsTest -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false
  2. In the pom.xml file, add
     <dependency>
    <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>2.11.0</version>
     </dependency>

    in the <dependencies> group,

  3. In fileUtilsTest/src/main/java/com/popbr/fileUtilsTest/App.java copy-and-paste the code
    
    package com.popbr.fileUtilsTest;
    import org.apache.commons.io.FileUtils;
    import java.io.File;
    import java.io.IOException;
    import java.net.URL;

public class App { public static void main(String[] args) { try { URL url = new URL("https://www.nsf.gov/awardsearch/download.jsp"); // don't just assign the string to the url: create a URL object with it. String Fname = "example.html"; // Give some extension to your file (here, I added ".html"). File destination = new File(Fname); // You could avoid creating a string, but ok. FileUtils.copyURLToFile(url, destination); } catch (IOException e) { // You were not catching exceptions, which is weird to me. e.printStackTrace(); } } }


5. Finally, run `mvn compile` and `mvn exec:java -Dexec.mainClass="com.popbr.app.App"`.
aubertc commented 1 year ago

(I assume that by "fire it up", you meant "using command-line java")

aubertc commented 1 year ago

I had a look at https://github.com/popbr/data-integration/tree/main/fileUtilsTest and have a few comments:

  1. The naming gives the command mvn exec:java -Dexec.mainClass="com.popbr.fileUtilsTest.App", which is not very consistent with your previous naming schemes.
  2. I obtained the error java.lang.NullPointerException: Cannot read the array length because "fileN" is null. I believe I need to understand why you declare https://github.com/popbr/data-integration/blob/d9530072db5d28dac844666153ac19baebe51e80/fileUtilsTest/src/main/java/com/popbr/fileUtilsTest/App.java#L18 before overriding it with "hard-coded" values at https://github.com/popbr/data-integration/blob/d9530072db5d28dac844666153ac19baebe51e80/fileUtilsTest/src/main/java/com/popbr/fileUtilsTest/App.java#L19 Can't we do something more modular?
MNSleeper commented 1 year ago
  1. To the naming of it:

mvn archetype:generate -DgroupId=com.popbr.fileUtilsTest -DartifactId=fileUtilsTest -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false

was the command I used, which you suggested in the reply here https://github.com/popbr/data-integration/issues/13#issuecomment-1289288750. I do intend to integrate this program's functions with the larger program once its bugs/functions are worked out and it works fully.

  1. I apologize for the lack of modularity, I wasn't thinking back when I tested this. I did that hardcoded value for testing when it was throwing files in odd places. I can fix it later tonight
MNSleeper commented 1 year ago

The hardcoding problem has been removed. As a bonus, the unzipped files now go directly into the downloads folder, not some extra folder in the downloads folder.