Closed MOJTABAFA closed 8 years ago
No worries; I'm working constantly on getting my laptop cleaned out and running again, so I appreciate that you're taking the initiative anyway. I'll work on the code with you as soon as my laptop is ready (which will hopefully be tomorrow).
As for reading in the data: assuming the input format of the data is the same (cc @LindberghLi ), yes you'll use sc.textFile
. By invoking this command, the SparkContext
object (sc
) implicitly uses parallelize
, so you don't need to call that.
In fact, textFile
is what's called a "lazy" operation, meaning Spark won't actually read the text file until you do something to modify it (or print its contents out).
can I change the R1DL_Spark.py a little as follows :
import argparse
from pyspark import SparkContext, SparkConf
def main(sc):
pass
if __name__ == "__main__":
# Set up the arguments here.
parser = argparse.ArgumentParser(description = 'PySpark Dictionary Learning',
add_help = 'How to use', prog = 'python pyspark-example.py <args>')
parser.add_argument("-i", "--input", required = True,
help="Input File name.(file_s)")
parser.add_argument("-d", "--dictionary", required = True,
help="Dictionary File name.(file_D)")
parser.add_argument("-o", "--output", required = True,
help="Output File name.(file_Z)")
parser.add_argument("-n", "--pnonzero", type = float, required = True,
help="Percentage of Non-zero elements.")
parser.add_argument("-m", "--mDicatom", type = int, required = True,
help="Number of the dictionary atoms.")
parser.add_argument("-e", "--epsilon", type = float, required = True,
help="The value of epsilon.")
args = vars(parser.parse_args())
# Initialize the SparkContext. This is where you can create RDDs,
# the Spark abstraction for distributed data sets.
sc = SparkContext(conf = SparkConf())
No, let's leave it as is for now.
As @MOJTABAFA requested, I have put the transpose of the 700Mb test file on:
http://hafni.cs.uga.edu/test_T.txt
The "4.5 million" fMRI data is right now distributed as 68 individual files, I can merge them as a single large file if needed, or we can write a small interface to load them in batch.
Thanks xiang, I'll test it and inform you soon.
Actually, spark's "textFile()" method will work on a directory of text files; it reads recursively.
iPhone'd
On Dec 29, 2015, at 17:53, LindberghLi notifications@github.com wrote:
As @MOJTABAFA requested, I have put the transpose of the 700Mb test file on:
http://hafni.cs.uga.edu/test_T.txt
The "4.5 million" fMRI data is right now distributed as 68 individual files, I can merge them as a single large file if needed, or we can write a small interface to load them in batch.
— Reply to this email directly or view it on GitHub.
@magsol in starting with spark , I'm so sorry if my question are too simple , since I'm very beginner in spark, I need your supports too. we can start from importing a text file and making an RDD for this command as follows :
S = sc.txtFile('../../file_s.txt)
Am I right ? is it needed to use SC.paralellize() at the beginning ?