petermr / openVirus

aggregation of scholarly publications and extracted knowledge on viruses and epidemics.
The Unlicense
67 stars 17 forks source link

Double Backslash problem in Windows - Jupyter Notebook #92

Open ShweataNHegde opened 4 years ago

ShweataNHegde commented 4 years ago

Hello, I tried running text.ipynb (https://github.com/petermr/ami3/blob/master/src/ipynb/text.ipynb) on Windows 10 Home. But the path of the files that gets printed has double backslashes, instead of one. Details below.

When I run the following cell, the path of the files that gets printed has double backslashes, instead of one.

project = 'C:/Users/shweata/ami3/src/test/resources/org/contentmine/ami/zika10'
# os.chdir(project)
file_glob = 'PMC*'
files = get_globbed_files(project, file_glob)
print("number of " + file_glob + " files: " + str(len(files)) + "\n " + str(files))
print("file type " + str(type(files)))
abstract_files = get_globbed_files(project, 'PMC*/sections/abstract/*.xml')
print("abstracts " + str(abstract_files))    
text_files = get_globbed_files(project, 'PMC*/sections/**/*.xml', recursive=False)
print("number of xml text files: " + str(len(text_files)) +"\n" + str(text_files))
figure_files = get_globbed_files(project, 'PMC*/sections/**/*figure*.xml', recursive=False)
# print("number of figure files: " + str(len(figure_files)) +"\n" + str(figure_files))

The output:

bstracts ['PMC3113902\\sections\\abstract\\elem_0.xml', 'PMC320490\\sections\\abstract\\background__4_0.xml', 'PMC3289602\\sections\\abstract\\author_summary_1.xml', 'PMC3289602\\sections\\abstract\\background__3_0.xml', 'PMC3310194\\sections\\abstract\\elem_0.xml', 'PMC3310457\\sections\\abstract\\elem_0.xml', 'PMC3310457\\sections\\abstract\\elem_1.xml', 'PMC3310660\\sections\\abstract\\elem_0.xml', 'PMC3321795\\sections\\abstract\\elem_0.xml', 'PMC3321797\\sections\\abstract\\elem_0.xml']
number of xml text files: 141
['PMC3113902\\sections\\2_back\\0_ack.xml', 'PMC3113902\\sections\\abstract\\elem_0.xml', 'PMC3113902\\sections\\article\\elem_0.xml', 'PMC3113902\\sections\\figures\\figure_1.xml', 'PMC3113902\\sections\\figures\\figure_2.xml', 'PMC320490\\sections\\2_back\\0_ack.xml', 'PMC320490\\sections\\3_floats-group\\0_figure_1.xml', 'PMC320490\\sections\\3_floats-group\\1_figure_2.xml', 'PMC320490\\sections\\3_floats-group\\2_table_1.xml', 'PMC320490\\sections\\3_floats-group\\3_table_2.xml', 'PMC320490\\sections\\3_floats-group\\4_figure_3.xml', 'PMC320490\\sections\\3_floats-group\\5_figure_4.xml', 'PMC320490\\sections\\3_floats-group\\6_figure_5.xml', 'PMC320490\\sections\\abstract\\background__4_0.xml', 'PMC320490\\sections\\article\\elem_0.xml', 'PMC320490\\sections\\figures\\figure_1.xml', 'PMC320490\\sections\\figures\\figure_2.xml', 'PMC320490\\sections\\figures\\figure_3.xml', 'PMC320490\\sections\\figures\\figure_4.xml', 'PMC320490\\sections\\figures\\figure_5.xml', 'PMC320490\\sections\\tables\\table_1.xml', 'PMC320490\\sections\\tables\\table_2.xml', 'PMC3289602\\sections\\0_introduction\\0_title.xml', 'PMC3289602\\sections\\0_introduction\\1_p.xml', 'PMC3289602\\sections\\0_introduction\\2_p.xml', 'PMC3289602\\sections\\0_introduction\\3_p.xml', 'PMC3289602\\sections\\0_introduction\\4_p.xml', 'PMC3289602\\sections\\0_introduction\\5_p.xml', 'PMC3289602\\sections\\1_methods\\0_title.xml', 'PMC3289602\\sections\\2_back\\0_fn-group.xml', 'PMC3289602\\sections\\2_results\\0_title.xml', 'PMC3289602\\sections\\3_discussion\\0_title.xml', 'PMC3289602\\sections\\3_floats-group\\0_table_1.xml', 'PMC3289602\\sections\\3_floats-group\\1_table_2.xml', 'PMC3289602\\sections\\3_floats-group\\2_figure_1.xml', 'PMC3289602\\sections\\3_floats-group\\3_table_3.xml', 'PMC3289602\\sections\\3_floats-group\\4_figure_2.xml', 'PMC3289602\\sections\\4_floats-group\\0_table_1.xml', 'PMC3289602\\sections\\4_floats-group\\1_table_2.xml', 'PMC3289602\\sections\\4_floats-group\\2_figure_1.xml', 'PMC3289602\\sections\\4_floats-group\\3_table_3.xml', 'PMC3289602\\sections\\4_floats-group\\4_figure_2.xml', 'PMC3289602\\sections\\abstract\\author_summary_1.xml', 'PMC3289602\\sections\\abstract\\background__3_0.xml', 'PMC3289602\\sections\\acknowledge\\elem_0.xml', 'PMC3289602\\sections\\article\\elem_0.xml', 'PMC3289602\\sections\\figures\\figure_1.xml', 'PMC3289602\\sections\\figures\\figure_2.xml', 'PMC3289602\\sections\\methods\\methods__4_0.xml', 'PMC3289602\\sections\\tables\\table_1.xml', 'PMC3289602\\sections\\tables\\table_2.xml', 'PMC3289602\\sections\\tables\\table_3.xml', 'PMC3310194\\sections\\2_back\\0_ack.xml', 'PMC3310194\\sections\\2_back\\2_app-group.xml', 'PMC3310194\\sections\\3_floats-group\\0_table-wrap.xml', 'PMC3310194\\sections\\3_floats-group\\10_figure_10_.xml', 'PMC3310194\\sections\\3_floats-group\\11_figure_11_.xml', 'PMC3310194\\sections\\3_floats-group\\12_figure_12_.xml', 'PMC3310194\\sections\\3_floats-group\\13_supplementary-material.xml', 'PMC3310194\\sections\\3_floats-group\\14_supplementary-material.xml', 'PMC3310194\\sections\\3_floats-group\\15_supplementary-material.xml', 'PMC3310194\\sections\\3_floats-group\\16_supplementary-material.xml', 'PMC3310194\\sections\\3_floats-group\\17_supplementary-material.xml', 'PMC3310194\\sections\\3_floats-group\\18_supplementary-material.xml', 'PMC3310194\\sections\\3_floats-group\\19_supplementary-material.xml', 'PMC3310194\\sections\\3_floats-group\\1_figure_1_.xml', 'PMC3310194\\sections\\3_floats-group\\20_supplementary-material.xml', 'PMC3310194\\sections\\3_floats-group\\21_appendix_figure_1_.xml', 'PMC3310194\\sections\\3_floats-group\\22_appendix_figure_2_.xml', 'PMC3310194\\sections\\3_floats-group\\23_appendix_figure_3_.xml', 'PMC3310194\\sections\\3_floats-group\\2_figure_2_.xml', 'PMC3310194\\sections\\3_floats-group\\3_figure_3_.xml', 'PMC3310194\\sections\\3_floats-group\\4_figure_4_.xml', 'PMC3310194\\sections\\3_floats-group\\5_figure_5_.xml', 'PMC3310194\\sections\\3_floats-group\\6_figure_6_.xml', 'PMC3310194\\sections\\3_floats-group\\7_figure_7_.xml', 'PMC3310194\\sections\\3_floats-group\\8_figure_8_.xml', 'PMC3310194\\sections\\3_floats-group\\9_figure_9_.xml',

(Truncated)
When I try running the subsequent cell,

text_contents = []
for text_file in text_files:
    text_filex = open(text_file,mode='r')
    text = text_filex.read()
    text_filex.close()
    text_contents.append(text)
len(text_contents) 
# text_contents

I get the following error.

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-4-43f8fb121a67> in <module>
      1 text_contents = []
      2 for text_file in text_files:
----> 3     text_filex = open(text_file,mode='r')
      4     text = text_filex.read()
      5     text_filex.close()
FileNotFoundError: [Errno 2] No such file or directory: 'PMC3113902\\sections\\2_back\\0_ack.xml'

I looked it up online for help, and this is what I found. (https://lerner.co.il/2018/07/24/avoiding-windows-backslash-problems-with-pythons-raw-strings/ ). I tried the solutions suggested in this article, but that didn't help.
I have very little experience with programming and, any help regarding this would be appreciated.

petermr commented 4 years ago

I will ask on Shuttleworth Slack

petermr commented 4 years ago

I think I should be using Path... Just a guess at present.