using xlsx file for urls

aidotours commented 4 years ago

Hi Maciej,

Thanks for the great work. I have only recently started studying python and am not yet an expert.

As a massive football fan I was trying to get some team lineups to match players and found your repository which is awesome. It has so much potential and has now introduced me to Selenium and its wonders as well.

However I seem to be having difficulty running main.py.

I get the following traceback Traceback (most recent call last): File "main.py", line 174, in urls = get_urls_xlsx(urls_path, data_path) File "main.py", line 98, in get_urls_xlsx file_names = os.listdir(path_data) FileNotFoundError: [Errno 2] No such file or directory: 'data'

I don't understand what data should be pointing to. What more do you need than the xlsx file with the urls?

As I am still learning all help would be appreciated.

Kind regards Adrian

msarnacki commented 4 years ago

Hey Adrian!

Thank you for your interest in my project.

File "main.py", line 174, in urls = get_urls_xlsx(urls_path, data_path) File "main.py", line 98, in get_urls_xlsx file_names = os.listdir(path_data) FileNotFoundError: [Errno 2] No such file or directory: 'data' I don't understand what data should be pointing to. What more do you need than the xlsx file with the urls?

data_path = 'data'

'data' is value of data_path variable. This variable is used in get_urls_xlsx function as you can see below. This function has two arguments:

urls_path which is path to xlsx file with urls you want to use
data_path which is path to directory with already scraped files. In my case: in the directory I have my script I created folder named "data" and this is where scraped data will be stored. Finction that gets urls from xlsx needs path to this directory because it has to check which files are already scraped.

On this remote repo I don't have 'data' folder with all the .xlsx files becaues I don't want to shere the scraped data.

urls = get_urls_xlsx(urls_path, data_path)

def get_urls_xlsx(path_urls, path_data):
    df = pd.read_excel(path_urls, usecols = ['URL'])
    urls1 = df['URL'].tolist()

    file_names = os.listdir(path_data)
    files_league = [f.split('.')[0] for f in file_names]

    urls2 = []
    for url in urls1:
        url_league = url.split('/')[-3]   
        if (url_league not in files_league) and (url_league != ''):
           urls2.append(url)

    return urls2

If you want to run the code without an error you should create folder 'data' in the directory where you have this script. Let me know if this works for you.

Kind regards Maciej

aidotours commented 4 years ago

Hi Maciej,

I got it. I should have understood the first time. The info is in the read me. But I think I missed it as my understanding is still not great.

Thanks for the help.

msarnacki / flashscore-scraper

using xlsx file for urls #2